Top Banner
1 Parallel Numerical Simulation Lesson 4 Basics of Parallel Systems and Programs Ioan Lucian Muntean SCCS, Technische Universität München St. Kliment Ohridski University of Bitola, Faculty of Technical Sciences October 4, 2005 I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs 2 Contents General Remarks Interconnection Networks Classification of the Supercomputers Top 500 Highlights Parallel Programming Paradigms Performance Measurements
26

Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

Sep 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

1

Parallel Numerical Simulation

Lesson 4

Basics of Parallel Systems and Programs

Ioan Lucian MunteanSCCS, Technische Universität München

St. Kliment Ohridski University of Bitola,Faculty of Technical Sciences

October 4, 2005

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

2

Contents

� General Remarks

� Interconnection Networks

� Classification of the Supercomputers

� Top 500 Highlights

� Parallel Programming Paradigms

� Performance Measurements

Page 2: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

3

Implementation: Target Architectures

� different target architectures for numerical simulations:

� monoprocessors: many everyday simulation applications are designed to run on PCs or ordinary workstations; obtaining optimum efficiency requires knowledge of how modern microprocessors work

� supercomputers: numerical simulations have always been the most important application of high-performance computers, as well as the driving force of supercomputer development; obtaining optimum efficiency requires architecture-based tuning

� computer development follows Moore‘s law: every 5 years a performance increase by 10

� performance distance between mass market computers and supercomputers nearly constant (factor >100)

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

4

Modern Microprocessors

� obvious trends:

� increasing clock rates (> 2GHz almost standard)

� more MIPS, more FLOPS

� very-, ultra-, and ???-large scale integration; hence, more transistors and more functionality on the chip

� longer words: 64 Bit architectures are standard (workstations, PCs)

� important features:

� RISC (Reduced Instruction Set Computer) technology

� well-developed pipelining

� superscalar processor organization

� caching and multi-level memory hierarchy

� VLIW, Multi-threaded Architecture, On-chip multiprocessors, ...

Page 3: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

5

RISC Technology, Pipelining

� RISC-technology:

� counter-trend to CISC: more and more complex instructions entailing microprogramming

� now instead:

� relatively small number of instructions (tens)

� simple machine instructions, fixed format, few address modes

� one cycle per instruction

� load-and-store principle: only explicit LOAD/WRITE instructions have memory access

� no more need for microprogramming

� pipelining:

� decompose instructions into simple steps involving different partsof the CPU: load, decode, reserve registers, execute, write results (Alpha 21164, FP-DIV double prec.: 61 clocks)

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

6

Pipelining, Superscalar Processors

� pipelining (continued):� further improvement: reorder

steps of an instruction (LOAD as early as possible, WRITE as late as possible: avoids risk of idle waiting time)

� best case: identical instructions to be pipelined/overlapped, as in vector processors

� pipelining needs different functional units in the CPU that can deal with the different steps in parallel; therefore:

� superscalar processor organization:� several parts of the CPU are

available in more than 1 copy

Page 4: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

7

Cache Memory

� cache memory:

� aim: reduce memory access time / latency (CPU performance increased faster than memory access speed)

� cache memory: small and fast on-chip memory, keeps parts of the main memory

� optimum: needed data is always available in cache memory

� access time of main memory / cache / effective access time, hit probability p:

� look for strategies to ensure p close to 1:

� what to be kept in cache?

� ensure locality of data (instructions in cache need data in cache)

� strategies for fetching, replacement, and updating

� association: how to check whether data are available in cache?

� consistency: no different versions in cache and main memory

mcetptpt ⋅−+⋅= )1(

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

8

Memory Hierarchy

� today: several cache levels� SGI Power Challenge: 32 kB on-chip primary cache , up to 16 MB off-

chip secondary cache ; sometimes also level-3 cache

� together: memory hierarchy: register, (level-1/2/3) cache, main memory, hard disk, remote memory: the faster, the smaller

� notion of the target computer‘s memory hierarchy is important for numerical algorithms‘ efficiency:

� example: matrix-vector product Ax with A too large for cache

A[m,n], X[n], Y[m] and Y = A*X

for i = 1 to m do

begin

Y[i] = 0;

for j = 1 to n do

Y[i] = Y[i] + A[i,j]*X[j];

end;

� tuning crucial: the peak performance up to 4 orders of magnitude higher than the performance observed in practice (without tuning)

Page 5: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

9

Parallelization, Parallelism, Parallel Computers� parallel computers – distributed systems: frontier?

� what has to be distinguished:� What is to be parallelized (code or data; competition)?

� Where is parallelization done (programs / processes / machine instructions / microinstructions)?

� Who parallelizes (manual or explicit / interactive / automatic or implicit)?

� topology of the system: arrangement of processors, structure of network, static or dynamic topology

� synchronization: loose or tight coupling

� communication: implicitly via shared memory or explicitly via messages

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

10

Contents

� General Remarks

� Interconnection Networks

� Classification of the Supercomputers

� Top 500 Highlights

� Parallel Programming Paradigms

� Performance Measurements

Page 6: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

11

Interconnection Networks

� Access to remote data in parallel computer requires communication among processors (or between processors andmemories)

� Direct point-to-point connections among large numbers ofprocessors (or memories) is infeasible� O(p2) wires would be required

� Connections are provided only between selected pairs ofprocessors (or memories)� routing through intermediate processors or switches is required for

communication between other pairs

� The topology of the resulting sparsely connected network partly determines the latency and the aggregate bandwidth of communication

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

12

Topologies

� STATIC : fixed connections – do not vary during the program execution

� DYNAMIC : dynamically configured to match the communication demand during the execution

� Parameters

� LATENCY

� BANDWIDTH

� HARDWARE COMPLEXITY

� SCALABILITY

Page 7: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

13

Static Topologies

L inear array R ing S tar

0 1

N

0

1

N

Binary tree Fat tree

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

14

Hypercubes

Static Topologies

Mesh TorusIlliac Mesh

Page 8: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

15

Dynamic Topologies

8x8 Omega network,using 2x2 switches

Interstageconnectingpattern

P1

P2

Pn

CROSSBAR

SW ITCH

NETWORK

���

M1 M2 Mm

���

switch

num_switches = n2 num_switches = ½ n * log2n

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

16

Properties of Networks

� Model network as graph with processors (switches) as nodesand wires between them as edges

� Degree : maximum number of edges incident on any node

� Diameter : largest number of edges in shortest path between any pair of nodes

� Edge length : maximum physical length of any wire

Page 9: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

17

Properties of Networks

kN/2kk, k = logNHypercubes

2(N-r)2(r-1), N=r*r4Mesh

N-12(k-1),k=logN3Binary Tree

N-12N-1Star

N-1N-12Linear

Topology Degree Diameter Num. Links

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

18

Graph Embeding

� Mapping task graph for a given problem to a network graph of atarget computer is an instance of a graph embedding

F: G1(V1, E1) ⇒ G2(V2, E2)

� Dilation : maximum distance between any two nodes F(x) and F(y)in G2 such that x and y are adjacent in G1

� Load : maximum number of nodes in V1 mapped onto any one node in V2

� Congestion : maximum number of edges in E1 mapped onto any one edge in E2

Page 10: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

19

Graph Embeding

� Ideally, we want dilation, load, and congestion all to be 1, butsuch perfect embedding is not always possible

� Determining the optimal embedding between two arbitrary graphs is a hard combinatorial problem (NP-complete), soheuristics are usually used to determine a “good” embedding

� Many particular cases occur frequently in practice for whichgood, or even optimal, embeddings are known

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

20

Example: Graph Embeding

� Ring can be embedded perfectly in 2-D mesh with same numberof nodes if and only if mesh has even number of rows or columns

� Complete binary tree with k levels can be embedded in 2-D mesh with dilation (k-1)/2

� Many types of graphs can be embedded in hypercubes effectively, often perfectly

� 2-D mesh or torus with 2j x 2k processors can be embedded perfectly in hypercube with 2j+k processors

Page 11: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

21

Practical Networks

� For MPPs, hypercube networks

� small diameter, high bandwidth, algorithmic elegance, andflexibility in accommodating many different applications

� variable degree and edge length - complicated design andmanufacturing of hypercube networks

� Most MPPs today – 2-D or 3-D mesh networks

� constant degree and edge length, match well with grid-based applications

� Important conceptual paradigm for many parallel algorithms, such as collective communication operations

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

22

Contents

� General Remarks

� Interconnection Networks

� Classification of the Supercomputers

� Top 500 Highlights

� Parallel Programming Paradigms

� Performance Measurements

Page 12: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

23

Flynn‘s Classification (1972)

� SISD: Single Instruction Single Data� classical von-Neumann monoprocessor

� SIMD: Single Instruction Multiple Data� vector computers: extreme pipeling, one instruction applied to a

sequence (vector) of data (CRAY 1,2,X,Y,J/C/T90,...)

� array computers: array of processors, concurrency (Thinking Machines CM-2, MasPar MP-1, MP-2)

� MIMD: Multiple Instruction Multiple Data� multiprocessors: distributed memory (loose coupling, explicit

communication; Intel Paragon, IBM SP-2)

� shared memory (tight coupling, global address space, implicit communication; most workstation servers)

� nets/clusters

� MISD: Multiple Instruction Single Data : rare

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

24

Flynn‘s Classification (1972)

Control

Unit

Processing

Unit

Memory

Unit

InstructionStream

I/O

Data

Stream

Instruction

Stream

SISD

CU

Processing

Element

Local

Memory

IS

IS

DS

DS

DS

DSProcessing

Element

Local

Memory

Processor array

SIMD

Memory

IS

DS

DS

DS Processing

Unit

Control

Unit

���

IS

I/O

Control

Unit

Processing

Unit

IS IS

MISD

Shared

Memory

IS

DS

DS

Processing

Unit

Control

Unit

ISI/O

Control

Unit

Processing

Unit

IS

ISI/O

MIMD

Page 13: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

25

Memory Access Classification

� UMA: Uniform Memory Access

� shared memory systems: SMP (symmetric multiprocessors, parallel vector processors); PC- and WS-servers, CRAY YMP

� Advantage: portability , programming model and load distribution;

� Drawback: scalability

Interconnecting Network

PPP

Shared

Memory

Shared

Memory

Shared

Memory

���

���

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

26

Memory Access Classification

� NUMA: Non-Uniform MemoryAccess

� systems with virtually shared memory; KSR-1, CRAY T3D/T3E, CONVEX SPP

� Advantage: portability ,programming model, scalability

� Drawback: cache-coherence, communication

Inter-

connecting

Network

P

P

P

Shared

Local Memory

Shared

Local Memory

Shared

Local Memory

Page 14: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

27

Memory Access Classification

� NORMA: No Remote Memory Access

� distributed memory systems;clusters, IBM SP-2, iPSC/860

� Advantage: scalability

� Drawback: portability , programming model and load distribution

Global

Interconnecting

Network

Global

Shared

Memory

Cluster

Cluster

�Cluster

Cluster

Shared

Memory

Cluster

Inter-

connecting

Network

P

LM

LM

P

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

28

Contents

� General Remarks

� Interconnection Networks

� Classification of the Supercomputers

� Top 500 Highlights

� Parallel Programming Paradigms

� Performance Measurements

Page 15: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

29

Top500.org (top 10, June 2005)

Source: www.top500.org

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

30

Top500.org (top 10, June 2005)

Source: www.top500.org

Page 16: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

31

Top 500 - Countries

Results of the last edition of Top 500 – June 2005.

Source: www.top500.org

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

32

Results of the last edition of Top 500 – June 2005.

Source: www.top500.org

Top 500 - Architectures Overview

Page 17: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

33

Top 500 - Architectures Overview

Results of the last edition of Top 500 – June 2005.

Source: www.top500.org

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

34

Top 500 - Clusters Overview

Results of the last edition of Top 500 – June 2004!

Source: www.top500.org

Page 18: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

35

Top 500 - Processors Overview

Results of the last edition of Top 500 – June 2005.

Source: www.top500.org

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

36

Top 500 - Processors Overview

Results of the last edition of Top 500 – June 2005.

Source: www.top500.org

Page 19: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

37

Contents

� General Remarks

� Interconnection Networks

� Classification of the Supercomputers

� Top 500 Highlights

� Parallel Programming Paradigms

� Performance Measurements

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

38

Parallel Programming Paradigms

� Main paradigms for parallel programming

� in increasing order of detail the programmer must explicitly specify:

� Functional languages

� Parallelizing compilers

� Object oriented

� Data parallel

� Shared memory

� Remote memory access

� Message passing

Page 20: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

39

Parallelizing Compilers

� Ideally, computers should be able to figure out parallelism for us: “holy grail” of parallel programming would be for compiler automatically to parallelize programs written in conventional sequential programming languages

� Like general AI, this has proved to be extraordinarily difficult for an arbitrary serial code

� Usual practical approach is for compiler to analyze serial loops for potential parallel execution, based on careful dependence analysis of variables occurring in loop

� User can usually provide hints (called directives ) to help compiler determine when loops can be parallelized and how

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

40

Message Passing

� Message passing provides two-sided communications, send andreceive , between processes

� In this model, a process has access only to its own local memory, somust exchange messages with other processes to satisfy data dependences between processes

� Message passing is most natural and efficient paradigm for distributed-memory systems

� Message passing can also be implemented efficiently in shared-memory or almost any other parallel architecture, so it is mostportable parallel programming paradigm

Page 21: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

41

Message Passing

Message-

Passing

Network

MemoryCPU

MemoryCPU

Memory CPU

Memory CPU

� Message passing fits well with the design philosophy of clusters

� offers great flexibility in exploiting data locality, tolerating latency,and other performance enhancement techniques

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

42

Message Passing

� Debugging� often easier with message passing

� accidental overwriting of memory is less likely and much easier todetect than with shared-memory paradigms

� Programming with message passing � sometimes criticized as being tedious and low-level

� tends to result in programs with good performance, scalability, andportability

� Naturally well suited to distributed memory

� Dominant paradigm for scalable applications on massivelyparallel systems

Page 22: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

43

Contents

� General Remarks

� Interconnection Networks

� Classification of the Supercomputers

� Top 500 Highlights

� Parallel Programming Paradigms

� Performance Measurements

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

44

Measures of Parallel Performance

� T1 serial execution time on one processor

� Tp parallel execution time on p processors

� Speedup : Sp = T1/Tp

� Efficiency : Ep = T1/(p*Tp)

� Thus, Ep = Sp/p and Sp = p*Ep

� Pseudotheorem: Sp ≤ p and Ep ≤ 1

� But “speedup anomalies” can occur if resources (e.g., cache)increase as p increases, so that effective computation rate increases

Page 23: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

45

Amdahl’s Law

� Assumptions

� Serial fraction = s, 0 ≤ s ≤ 1

� Parallel fraction = 1 - s

� Conclusions

� Tp = s*T1 + (1-s)*T1/p

� Sp = p/(s*p + (1-s))

� Ep = 1/(s*p + (1-s))

� Corollary:

� Sp = 1/s and Ep → 0 as p → ∞

� Amdahl’s Law induced early pessimism on potential of parallel

computing

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

46

Problem Scaling

� Amdahl’s Law – only for fixed problems, or when the serial fraction is independent of the problem size (rarely true inpractice!!)

� Larger computers – for solving larger problems, and serial fraction usually decreases with problem size

� The algorithm – scalable if efficiency can be maintained at constant value (or at least bounded away from zero) as number of processors grows by increasing problem size

Page 24: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

47

Pitfalls of Asymptotic Analysis

� Asymptotic analysis often based on unrealistic model of parallelcomputation

� Asymptotic estimates apply for large n and p, but may not berelevant for actual values of interest

� Lower-order terms may be significant for n and p of practical interest� Example: If the complexity is 10*n+n*logn, then linear term is

actually larger for n < 1024

� Proportionality constants may make an important practical difference� Example: The complexity of 10*n2 is actually better than

complexity of 1000*n*logn for n < 996

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

48

Parallel Performance Modeling

� Tp: time elapsed from the start of the execution on the first

processor until the end of the execution on the last processor

� Tcomp: serial execution time + time for any additional

computation due to parallel execution

� Tcomm: time spent sending and receiving messages

� Tidle: due to lack of work to do or to lack of necessary data (e.

g., waiting for message)

Tp =Tcomp + Tcomm + Tidle

Page 25: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

49

Communication Costs

� Time for sending a message modeled by

Tmsg = ts + tw*L

where ts is startup time, tw is transfer time per word, and L is message length in words

� Minimum latency is ts (for zero-length message)

� Bandwidth of communication channel is 1/tw

� Typically, ts is roughly two orders of magnitude larger than tw

for most real parallel systems

� Start-up term usually dominates the cost for small messages,bandwidth term for large messages

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

50

Idle Time

� Idle time – due to the lack of work

� reduced by improving the load balance

� Idle time due to lack of data

� reduced by overlapping computation and communication

Page 26: Parallel Numerical Simulation€¦ · 7 Cache Memory cache memory: aim: reduce memory access time / latency (CPU performance increased faster than memory access speed) cache memory:

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

51

Summary

� General Remarks

� Interconnection Networks

� Classification of the Supercomputers

� Top 500 Highlights

� Parallel Programming Paradigms

� Performance Measurements

I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs

52

Basics of Parallel Systems and Programs

Thank you for your attention!

Q&A