1 Parallel Numerical Simulation Lesson 4 Basics of Parallel Systems and Programs Ioan Lucian Muntean SCCS, Technische Universität München St. Kliment Ohridski University of Bitola, Faculty of Technical Sciences October 4, 2005 I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs 2 Contents General Remarks Interconnection Networks Classification of the Supercomputers Top 500 Highlights Parallel Programming Paradigms Performance Measurements
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Parallel Numerical Simulation
Lesson 4
Basics of Parallel Systems and Programs
Ioan Lucian MunteanSCCS, Technische Universität München
St. Kliment Ohridski University of Bitola,Faculty of Technical Sciences
October 4, 2005
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
2
Contents
� General Remarks
� Interconnection Networks
� Classification of the Supercomputers
� Top 500 Highlights
� Parallel Programming Paradigms
� Performance Measurements
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
3
Implementation: Target Architectures
� different target architectures for numerical simulations:
� monoprocessors: many everyday simulation applications are designed to run on PCs or ordinary workstations; obtaining optimum efficiency requires knowledge of how modern microprocessors work
� supercomputers: numerical simulations have always been the most important application of high-performance computers, as well as the driving force of supercomputer development; obtaining optimum efficiency requires architecture-based tuning
� computer development follows Moore‘s law: every 5 years a performance increase by 10
� performance distance between mass market computers and supercomputers nearly constant (factor >100)
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
4
Modern Microprocessors
� obvious trends:
� increasing clock rates (> 2GHz almost standard)
� more MIPS, more FLOPS
� very-, ultra-, and ???-large scale integration; hence, more transistors and more functionality on the chip
� longer words: 64 Bit architectures are standard (workstations, PCs)
� important features:
� RISC (Reduced Instruction Set Computer) technology
� what has to be distinguished:� What is to be parallelized (code or data; competition)?
� Where is parallelization done (programs / processes / machine instructions / microinstructions)?
� Who parallelizes (manual or explicit / interactive / automatic or implicit)?
� topology of the system: arrangement of processors, structure of network, static or dynamic topology
� synchronization: loose or tight coupling
� communication: implicitly via shared memory or explicitly via messages
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
10
Contents
� General Remarks
� Interconnection Networks
� Classification of the Supercomputers
� Top 500 Highlights
� Parallel Programming Paradigms
� Performance Measurements
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
11
Interconnection Networks
� Access to remote data in parallel computer requires communication among processors (or between processors andmemories)
� Direct point-to-point connections among large numbers ofprocessors (or memories) is infeasible� O(p2) wires would be required
� Connections are provided only between selected pairs ofprocessors (or memories)� routing through intermediate processors or switches is required for
communication between other pairs
� The topology of the resulting sparsely connected network partly determines the latency and the aggregate bandwidth of communication
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
12
Topologies
� STATIC : fixed connections – do not vary during the program execution
� DYNAMIC : dynamically configured to match the communication demand during the execution
� Parameters
� LATENCY
� BANDWIDTH
� HARDWARE COMPLEXITY
� SCALABILITY
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
13
Static Topologies
L inear array R ing S tar
0 1
N
0
1
N
Binary tree Fat tree
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
14
Hypercubes
Static Topologies
Mesh TorusIlliac Mesh
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
15
Dynamic Topologies
8x8 Omega network,using 2x2 switches
Interstageconnectingpattern
P1
P2
Pn
CROSSBAR
SW ITCH
NETWORK
�
�
�
���
M1 M2 Mm
�
�
�
���
switch
num_switches = n2 num_switches = ½ n * log2n
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
16
Properties of Networks
� Model network as graph with processors (switches) as nodesand wires between them as edges
� Degree : maximum number of edges incident on any node
� Diameter : largest number of edges in shortest path between any pair of nodes
� Edge length : maximum physical length of any wire
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
17
Properties of Networks
kN/2kk, k = logNHypercubes
2(N-r)2(r-1), N=r*r4Mesh
N-12(k-1),k=logN3Binary Tree
N-12N-1Star
N-1N-12Linear
Topology Degree Diameter Num. Links
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
18
Graph Embeding
� Mapping task graph for a given problem to a network graph of atarget computer is an instance of a graph embedding
F: G1(V1, E1) ⇒ G2(V2, E2)
� Dilation : maximum distance between any two nodes F(x) and F(y)in G2 such that x and y are adjacent in G1
� Load : maximum number of nodes in V1 mapped onto any one node in V2
� Congestion : maximum number of edges in E1 mapped onto any one edge in E2
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
19
Graph Embeding
� Ideally, we want dilation, load, and congestion all to be 1, butsuch perfect embedding is not always possible
� Determining the optimal embedding between two arbitrary graphs is a hard combinatorial problem (NP-complete), soheuristics are usually used to determine a “good” embedding
� Many particular cases occur frequently in practice for whichgood, or even optimal, embeddings are known
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
20
Example: Graph Embeding
� Ring can be embedded perfectly in 2-D mesh with same numberof nodes if and only if mesh has even number of rows or columns
� Complete binary tree with k levels can be embedded in 2-D mesh with dilation (k-1)/2
� Many types of graphs can be embedded in hypercubes effectively, often perfectly
� 2-D mesh or torus with 2j x 2k processors can be embedded perfectly in hypercube with 2j+k processors
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
21
Practical Networks
� For MPPs, hypercube networks
� small diameter, high bandwidth, algorithmic elegance, andflexibility in accommodating many different applications
� variable degree and edge length - complicated design andmanufacturing of hypercube networks
� Most MPPs today – 2-D or 3-D mesh networks
� constant degree and edge length, match well with grid-based applications
� Important conceptual paradigm for many parallel algorithms, such as collective communication operations
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
22
Contents
� General Remarks
� Interconnection Networks
� Classification of the Supercomputers
� Top 500 Highlights
� Parallel Programming Paradigms
� Performance Measurements
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
23
Flynn‘s Classification (1972)
� SISD: Single Instruction Single Data� classical von-Neumann monoprocessor
� SIMD: Single Instruction Multiple Data� vector computers: extreme pipeling, one instruction applied to a
sequence (vector) of data (CRAY 1,2,X,Y,J/C/T90,...)
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
27
Memory Access Classification
� NORMA: No Remote Memory Access
� distributed memory systems;clusters, IBM SP-2, iPSC/860
� Advantage: scalability
� Drawback: portability , programming model and load distribution
Global
Interconnecting
Network
Global
Shared
Memory
Cluster
Cluster
�
�
�Cluster
Cluster
Shared
Memory
Cluster
Inter-
connecting
Network
P
LM
�
�
�
LM
P
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
28
Contents
� General Remarks
� Interconnection Networks
� Classification of the Supercomputers
� Top 500 Highlights
� Parallel Programming Paradigms
� Performance Measurements
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
29
Top500.org (top 10, June 2005)
Source: www.top500.org
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
30
Top500.org (top 10, June 2005)
Source: www.top500.org
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
31
Top 500 - Countries
Results of the last edition of Top 500 – June 2005.
Source: www.top500.org
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
32
Results of the last edition of Top 500 – June 2005.
Source: www.top500.org
Top 500 - Architectures Overview
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
33
Top 500 - Architectures Overview
Results of the last edition of Top 500 – June 2005.
Source: www.top500.org
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
34
Top 500 - Clusters Overview
Results of the last edition of Top 500 – June 2004!
Source: www.top500.org
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
35
Top 500 - Processors Overview
Results of the last edition of Top 500 – June 2005.
Source: www.top500.org
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
36
Top 500 - Processors Overview
Results of the last edition of Top 500 – June 2005.
Source: www.top500.org
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
37
Contents
� General Remarks
� Interconnection Networks
� Classification of the Supercomputers
� Top 500 Highlights
� Parallel Programming Paradigms
� Performance Measurements
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
38
Parallel Programming Paradigms
� Main paradigms for parallel programming
� in increasing order of detail the programmer must explicitly specify:
� Functional languages
� Parallelizing compilers
� Object oriented
� Data parallel
� Shared memory
� Remote memory access
� Message passing
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
39
Parallelizing Compilers
� Ideally, computers should be able to figure out parallelism for us: “holy grail” of parallel programming would be for compiler automatically to parallelize programs written in conventional sequential programming languages
� Like general AI, this has proved to be extraordinarily difficult for an arbitrary serial code
� Usual practical approach is for compiler to analyze serial loops for potential parallel execution, based on careful dependence analysis of variables occurring in loop
� User can usually provide hints (called directives ) to help compiler determine when loops can be parallelized and how
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
� In this model, a process has access only to its own local memory, somust exchange messages with other processes to satisfy data dependences between processes
� Message passing is most natural and efficient paradigm for distributed-memory systems
� Message passing can also be implemented efficiently in shared-memory or almost any other parallel architecture, so it is mostportable parallel programming paradigm
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
41
Message Passing
Message-
Passing
Network
MemoryCPU
MemoryCPU
Memory CPU
Memory CPU
�
�
�
�
�
�
� Message passing fits well with the design philosophy of clusters
� offers great flexibility in exploiting data locality, tolerating latency,and other performance enhancement techniques
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
42
Message Passing
� Debugging� often easier with message passing
� accidental overwriting of memory is less likely and much easier todetect than with shared-memory paradigms
� Programming with message passing � sometimes criticized as being tedious and low-level
� tends to result in programs with good performance, scalability, andportability
� Naturally well suited to distributed memory
� Dominant paradigm for scalable applications on massivelyparallel systems
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
43
Contents
� General Remarks
� Interconnection Networks
� Classification of the Supercomputers
� Top 500 Highlights
� Parallel Programming Paradigms
� Performance Measurements
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
44
Measures of Parallel Performance
� T1 serial execution time on one processor
� Tp parallel execution time on p processors
� Speedup : Sp = T1/Tp
� Efficiency : Ep = T1/(p*Tp)
� Thus, Ep = Sp/p and Sp = p*Ep
� Pseudotheorem: Sp ≤ p and Ep ≤ 1
� But “speedup anomalies” can occur if resources (e.g., cache)increase as p increases, so that effective computation rate increases
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
45
Amdahl’s Law
� Assumptions
� Serial fraction = s, 0 ≤ s ≤ 1
� Parallel fraction = 1 - s
� Conclusions
� Tp = s*T1 + (1-s)*T1/p
� Sp = p/(s*p + (1-s))
� Ep = 1/(s*p + (1-s))
� Corollary:
� Sp = 1/s and Ep → 0 as p → ∞
� Amdahl’s Law induced early pessimism on potential of parallel
computing
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
46
Problem Scaling
� Amdahl’s Law – only for fixed problems, or when the serial fraction is independent of the problem size (rarely true inpractice!!)
� Larger computers – for solving larger problems, and serial fraction usually decreases with problem size
� The algorithm – scalable if efficiency can be maintained at constant value (or at least bounded away from zero) as number of processors grows by increasing problem size
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
47
Pitfalls of Asymptotic Analysis
� Asymptotic analysis often based on unrealistic model of parallelcomputation
� Asymptotic estimates apply for large n and p, but may not berelevant for actual values of interest
� Lower-order terms may be significant for n and p of practical interest� Example: If the complexity is 10*n+n*logn, then linear term is
actually larger for n < 1024
� Proportionality constants may make an important practical difference� Example: The complexity of 10*n2 is actually better than
complexity of 1000*n*logn for n < 996
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
48
Parallel Performance Modeling
� Tp: time elapsed from the start of the execution on the first
processor until the end of the execution on the last processor
� Tcomp: serial execution time + time for any additional
computation due to parallel execution
� Tcomm: time spent sending and receiving messages
� Tidle: due to lack of work to do or to lack of necessary data (e.
g., waiting for message)
Tp =Tcomp + Tcomm + Tidle
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
49
Communication Costs
� Time for sending a message modeled by
Tmsg = ts + tw*L
where ts is startup time, tw is transfer time per word, and L is message length in words
� Minimum latency is ts (for zero-length message)
� Bandwidth of communication channel is 1/tw
� Typically, ts is roughly two orders of magnitude larger than tw
for most real parallel systems
� Start-up term usually dominates the cost for small messages,bandwidth term for large messages
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
50
Idle Time
� Idle time – due to the lack of work
� reduced by improving the load balance
� Idle time due to lack of data
� reduced by overlapping computation and communication
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs
51
Summary
� General Remarks
� Interconnection Networks
� Classification of the Supercomputers
� Top 500 Highlights
� Parallel Programming Paradigms
� Performance Measurements
I.L. Muntean, Parallel Numerical Simulation, Bitola, October 2005, Basics of Parallel Systems and Programs