Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

1

Parallel Numerical Simulation

Lesson 4

Basics of Parallel Systems and Programs

Ioan Lucian MunteanSCCS, Technische Universität München

St. Kliment Ohridski University of Bitola,Faculty of Technical Sciences

October 4, 2005

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

2

Contents

� General Remarks

� Interconnection Networks

� Classification of the Supercomputers

� Top 500 Highlights

� Parallel Programming Paradigms

� Performance Measurements


3

Implementation: Target Architectures

� different target architectures for numerical simulations:

� monoprocessors: many everyday simulation applications are designed to run on PCs or ordinary workstations; obtaining optimum efficiency requires knowledge of how modern microprocessors work

� supercomputers: numerical simulations have always been the most important application of high-performance computers, as well as the driving force of supercomputer development; obtaining optimum efficiency requires architecture-based tuning

� computer development follows Moore‘s law: every 5 years a performance increase by 10

� performance distance between mass market computers and supercomputers nearly constant (factor >100)


4

Modern Microprocessors

� obvious trends:

� increasing clock rates (> 2GHz almost standard)

� more MIPS, more FLOPS

� very-, ultra-, and ???-large scale integration; hence, more transistors and more functionality on the chip

� longer words: 64 Bit architectures are standard (workstations, PCs)

� important features:

� RISC (Reduced Instruction Set Computer) technology

� well-developed pipelining

� superscalar processor organization

� caching and multi-level memory hierarchy

� VLIW, Multi-threaded Architecture, On-chip multiprocessors, ...


5

RISC Technology, Pipelining

� RISC-technology:

� counter-trend to CISC: more and more complex instructions entailing microprogramming

� now instead:

� relatively small number of instructions (tens)

� simple machine instructions, fixed format, few address modes

� one cycle per instruction

� load-and-store principle: only explicit LOAD/WRITE instructions have memory access

� no more need for microprogramming

� pipelining:

� decompose instructions into simple steps involving different partsof the CPU: load, decode, reserve registers, execute, write results (Alpha 21164, FP-DIV double prec.: 61 clocks)


6

Pipelining, Superscalar Processors

� pipelining (continued):� further improvement: reorder

steps of an instruction (LOAD as early as possible, WRITE as late as possible: avoids risk of idle waiting time)

� best case: identical instructions to be pipelined/overlapped, as in vector processors

� pipelining needs different functional units in the CPU that can deal with the different steps in parallel; therefore:

� superscalar processor organization:� several parts of the CPU are

available in more than 1 copy


7

Cache Memory

� cache memory:

� aim: reduce memory access time / latency (CPU performance increased faster than memory access speed)

� cache memory: small and fast on-chip memory, keeps parts of the main memory

� optimum: needed data is always available in cache memory

� access time of main memory / cache / effective access time, hit probability p:

� look for strategies to ensure p close to 1:

� what to be kept in cache?

� ensure locality of data (instructions in cache need data in cache)

� strategies for fetching, replacement, and updating

� association: how to check whether data are available in cache?

� consistency: no different versions in cache and main memory

mcetptpt ⋅−+⋅= )1(


8

Memory Hierarchy

� today: several cache levels� SGI Power Challenge: 32 kB on-chip primary cache , up to 16 MB off-

chip secondary cache ; sometimes also level-3 cache

� together: memory hierarchy: register, (level-1/2/3) cache, main memory, hard disk, remote memory: the faster, the smaller

� notion of the target computer‘s memory hierarchy is important for numerical algorithms‘ efficiency:

� example: matrix-vector product Ax with A too large for cache

A[m,n], X[n], Y[m] and Y = A*X

for i = 1 to m do

begin

Y[i] = 0;

for j = 1 to n do

Y[i] = Y[i] + A[i,j]*X[j];

end;

� tuning crucial: the peak performance up to 4 orders of magnitude higher than the performance observed in practice (without tuning)


9

Parallelization, Parallelism, Parallel Computers� parallel computers – distributed systems: frontier?

� what has to be distinguished:� What is to be parallelized (code or data; competition)?

� Where is parallelization done (programs / processes / machine instructions / microinstructions)?

� Who parallelizes (manual or explicit / interactive / automatic or implicit)?

� topology of the system: arrangement of processors, structure of network, static or dynamic topology

� synchronization: loose or tight coupling

� communication: implicitly via shared memory or explicitly via messages


10

Contents

� General Remarks







11

Interconnection Networks

� Access to remote data in parallel computer requires communication among processors (or between processors andmemories)

� Direct point-to-point connections among large numbers ofprocessors (or memories) is infeasible� O(p2) wires would be required

� Connections are provided only between selected pairs ofprocessors (or memories)� routing through intermediate processors or switches is required for

communication between other pairs

� The topology of the resulting sparsely connected network partly determines the latency and the aggregate bandwidth of communication


12

Topologies

� STATIC : fixed connections – do not vary during the program execution

� DYNAMIC : dynamically configured to match the communication demand during the execution

� Parameters

� LATENCY

� BANDWIDTH

� HARDWARE COMPLEXITY

� SCALABILITY


13

Static Topologies

L inear array R ing S tar

0 1

N

0

1

N

Binary tree Fat tree


14

Hypercubes

Static Topologies

Mesh TorusIlliac Mesh


15

Dynamic Topologies

8x8 Omega network,using 2x2 switches

Interstageconnectingpattern

P1

P2

Pn

CROSSBAR

SW ITCH

NETWORK

�

�

�

��

M1 M2 Mm

�

�

�

��

switch

num_switches = n2 num_switches = ½ n * log2n


16

Properties of Networks

� Model network as graph with processors (switches) as nodesand wires between them as edges

� Degree : maximum number of edges incident on any node

� Diameter : largest number of edges in shortest path between any pair of nodes

� Edge length : maximum physical length of any wire


17

Properties of Networks

kN/2kk, k = logNHypercubes

2(N-r)2(r-1), N=r*r4Mesh

N-12(k-1),k=logN3Binary Tree

N-12N-1Star

N-1N-12Linear

Topology Degree Diameter Num. Links


18

Graph Embeding

� Mapping task graph for a given problem to a network graph of atarget computer is an instance of a graph embedding

F: G1(V1, E1) ⇒ G2(V2, E2)

� Dilation : maximum distance between any two nodes F(x) and F(y)in G2 such that x and y are adjacent in G1

� Load : maximum number of nodes in V1 mapped onto any one node in V2

� Congestion : maximum number of edges in E1 mapped onto any one edge in E2


19

Graph Embeding

� Ideally, we want dilation, load, and congestion all to be 1, butsuch perfect embedding is not always possible

� Determining the optimal embedding between two arbitrary graphs is a hard combinatorial problem (NP-complete), soheuristics are usually used to determine a “good” embedding

� Many particular cases occur frequently in practice for whichgood, or even optimal, embeddings are known


20

Example: Graph Embeding

� Ring can be embedded perfectly in 2-D mesh with same numberof nodes if and only if mesh has even number of rows or columns

� Complete binary tree with k levels can be embedded in 2-D mesh with dilation (k-1)/2

� Many types of graphs can be embedded in hypercubes effectively, often perfectly

� 2-D mesh or torus with 2j x 2k processors can be embedded perfectly in hypercube with 2j+k processors


21

Practical Networks

� For MPPs, hypercube networks

� small diameter, high bandwidth, algorithmic elegance, andflexibility in accommodating many different applications

� variable degree and edge length - complicated design andmanufacturing of hypercube networks

� Most MPPs today – 2-D or 3-D mesh networks

� constant degree and edge length, match well with grid-based applications

� Important conceptual paradigm for many parallel algorithms, such as collective communication operations


22

Contents

� General Remarks







23

Flynn‘s Classification (1972)

� SISD: Single Instruction Single Data� classical von-Neumann monoprocessor

� SIMD: Single Instruction Multiple Data� vector computers: extreme pipeling, one instruction applied to a

sequence (vector) of data (CRAY 1,2,X,Y,J/C/T90,...)

� array computers: array of processors, concurrency (Thinking Machines CM-2, MasPar MP-1, MP-2)

� MIMD: Multiple Instruction Multiple Data� multiprocessors: distributed memory (loose coupling, explicit

communication; Intel Paragon, IBM SP-2)

� shared memory (tight coupling, global address space, implicit communication; most workstation servers)

� nets/clusters

� MISD: Multiple Instruction Single Data : rare


24

Flynn‘s Classification (1972)

Control

Unit

Processing

Unit

Memory

Unit

InstructionStream

I/O

Data

Stream

Instruction

Stream

SISD

CU

Processing

Element

Local

Memory

IS

IS

DS

DS

DS

DSProcessing

Element

Local

Memory

�

�

�

�

�

�

Processor array

SIMD

Memory

IS

DS

DS

DS Processing

Unit

Control

Unit

��

IS

I/O

Control

Unit

Processing

Unit

IS IS

MISD

Shared

Memory

IS

DS

DS

Processing

Unit

Control

Unit

�

�

�

ISI/O

Control

Unit

Processing

Unit

IS

ISI/O

�

�

�

MIMD


25

Memory Access Classification

� UMA: Uniform Memory Access

� shared memory systems: SMP (symmetric multiprocessors, parallel vector processors); PC- and WS-servers, CRAY YMP

� Advantage: portability , programming model and load distribution;

� Drawback: scalability

Interconnecting Network

PPP

Shared

Memory

Shared

Memory

Shared

Memory

��

��


26


� NUMA: Non-Uniform MemoryAccess

� systems with virtually shared memory; KSR-1, CRAY T3D/T3E, CONVEX SPP

� Advantage: portability ,programming model, scalability

� Drawback: cache-coherence, communication

Inter-

connecting

Network

P

P

P

Shared

Local Memory

Shared

Local Memory

Shared

Local Memory

�

�

�


27


� NORMA: No Remote Memory Access

� distributed memory systems;clusters, IBM SP-2, iPSC/860

� Advantage: scalability

� Drawback: portability , programming model and load distribution

Global

Interconnecting

Network

Global

Shared

Memory

Cluster

Cluster

�

�

�Cluster

Cluster

Shared

Memory

Cluster

Inter-

connecting

Network

P

LM

�

�

�

LM

P


28

Contents

� General Remarks







29

Top500.org (top 10, June 2005)

Source: www.top500.org


30

Top500.org (top 10, June 2005)



31

Top 500 - Countries

Results of the last edition of Top 500 – June 2005.



32



Top 500 - Architectures Overview


33

Top 500 - Architectures Overview




34

Top 500 - Clusters Overview

Results of the last edition of Top 500 – June 2004!



35

Top 500 - Processors Overview




36

Top 500 - Processors Overview




37

Contents

� General Remarks







38

Parallel Programming Paradigms

� Main paradigms for parallel programming

� in increasing order of detail the programmer must explicitly specify:

� Functional languages

� Parallelizing compilers

� Object oriented

� Data parallel

� Shared memory

� Remote memory access

� Message passing


39

Parallelizing Compilers

� Ideally, computers should be able to figure out parallelism for us: “holy grail” of parallel programming would be for compiler automatically to parallelize programs written in conventional sequential programming languages

� Like general AI, this has proved to be extraordinarily difficult for an arbitrary serial code

� Usual practical approach is for compiler to analyze serial loops for potential parallel execution, based on careful dependence analysis of variables occurring in loop

� User can usually provide hints (called directives ) to help compiler determine when loops can be parallelized and how


40

Message Passing

� Message passing provides two-sided communications, send andreceive , between processes

� In this model, a process has access only to its own local memory, somust exchange messages with other processes to satisfy data dependences between processes

� Message passing is most natural and efficient paradigm for distributed-memory systems

� Message passing can also be implemented efficiently in shared-memory or almost any other parallel architecture, so it is mostportable parallel programming paradigm


41

Message Passing

Message-

Passing

Network

MemoryCPU

MemoryCPU

Memory CPU

Memory CPU

�

�

�

�

�

�

� Message passing fits well with the design philosophy of clusters

� offers great flexibility in exploiting data locality, tolerating latency,and other performance enhancement techniques


42

Message Passing

� Debugging� often easier with message passing

� accidental overwriting of memory is less likely and much easier todetect than with shared-memory paradigms

� Programming with message passing � sometimes criticized as being tedious and low-level

� tends to result in programs with good performance, scalability, andportability

� Naturally well suited to distributed memory

� Dominant paradigm for scalable applications on massivelyparallel systems


43

Contents

� General Remarks







44

Measures of Parallel Performance

� T1 serial execution time on one processor

� Tp parallel execution time on p processors

� Speedup : Sp = T1/Tp

� Efficiency : Ep = T1/(p*Tp)

� Thus, Ep = Sp/p and Sp = p*Ep

� Pseudotheorem: Sp ≤ p and Ep ≤ 1

� But “speedup anomalies” can occur if resources (e.g., cache)increase as p increases, so that effective computation rate increases


45

Amdahl’s Law

� Assumptions

� Serial fraction = s, 0 ≤ s ≤ 1

� Parallel fraction = 1 - s

� Conclusions

� Tp = s*T1 + (1-s)*T1/p

� Sp = p/(s*p + (1-s))

� Ep = 1/(s*p + (1-s))

� Corollary:

� Sp = 1/s and Ep → 0 as p → ∞

� Amdahl’s Law induced early pessimism on potential of parallel

computing


46

Problem Scaling

� Amdahl’s Law – only for fixed problems, or when the serial fraction is independent of the problem size (rarely true inpractice!!)

� Larger computers – for solving larger problems, and serial fraction usually decreases with problem size

� The algorithm – scalable if efficiency can be maintained at constant value (or at least bounded away from zero) as number of processors grows by increasing problem size


47

Pitfalls of Asymptotic Analysis

� Asymptotic analysis often based on unrealistic model of parallelcomputation

� Asymptotic estimates apply for large n and p, but may not berelevant for actual values of interest

� Lower-order terms may be significant for n and p of practical interest� Example: If the complexity is 10*n+n*logn, then linear term is

actually larger for n < 1024

� Proportionality constants may make an important practical difference� Example: The complexity of 10*n2 is actually better than

complexity of 1000*n*logn for n < 996


48

Parallel Performance Modeling

� Tp: time elapsed from the start of the execution on the first

processor until the end of the execution on the last processor

� Tcomp: serial execution time + time for any additional

computation due to parallel execution

� Tcomm: time spent sending and receiving messages

� Tidle: due to lack of work to do or to lack of necessary data (e.

g., waiting for message)

Tp =Tcomp + Tcomm + Tidle


49

Communication Costs

� Time for sending a message modeled by

Tmsg = ts + tw*L

where ts is startup time, tw is transfer time per word, and L is message length in words

� Minimum latency is ts (for zero-length message)

� Bandwidth of communication channel is 1/tw

� Typically, ts is roughly two orders of magnitude larger than tw

for most real parallel systems

� Start-up term usually dominates the cost for small messages,bandwidth term for large messages


50

Idle Time

� Idle time – due to the lack of work

� reduced by improving the load balance

� Idle time due to lack of data

� reduced by overlapping computation and communication


51

Summary

� General Remarks







52

Basics of Parallel Systems and Programs

Thank you for your attention!

Q&A

Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

Documents