Top Banner
Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H [email protected] Workshop on Multi-core Technologies International Institute of Information Technology July 23 – 25, 2009, Hyderabad.
55

Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H [email protected] Workshop on Multi-core Technologies International Institute.

Dec 14, 2015

Download

Documents

Baylee Chambley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Parallel Programming and Algorithms : A Primer

Kishore KothapalliIIIT-H

[email protected]

Workshop on Multi-core TechnologiesInternational Institute of Information

TechnologyJuly 23 – 25, 2009, Hyderabad.

Page 2: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

GRAND CHALLENGE PROBLEMS

• Global change• Human genome• Fluid turbulence• Vehicle dynamics• Ocean circulation• Viscous fluid dynamics• Superconductor modeling• Quantum chromo dynamics• Vision

Page 3: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

APPLICATIONS

Nature of workloads.Computational and Storage demands of technical, scientific, digital media and business

applicationsFiner degrees of spatial and temporal resolutionA computational fluid dynamics(CFD) calculation on an airplane wing 512 X 64 X 256 grid 5000 fl-pt operations per grid point 5000 steps 2.1x1014 ft-ops. 3.5 minutes on a machine sustaining 1 trillion fl-ops A simulation of full aircraft 3.5 x 1017 grid points total of 8.7 x 1024 ft-pt operations on same machine requires more than 275,000 years

to complete.Simulation of magnetic materials at the level of 2000-atom systems require 2.64 Tflops of

computational power and 512 GB of storage. Full hard disk simulation 30 Tflops and 2 TB Current investigations limited about 1000 atoms 0.5 Tflops 250 GB Future investigations involving 10,000 atoms 100 Tflops 2.5TBDigital movies and special effects 1014 fl-pt operations per frame and 50 frames per

second 90-min movie represents 2.7 x 1019 fl-pt operations. It would take 2,000 1-Gflops CPUs approximately 150 days to complete the computation.

Inventory planning, risk analysis, workforce scheduling and chip design.

Page 4: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

• Old CW: Power is free, Transistors expensive

• New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on)

• Old: Multiplies are slow, Memory access is fast

• New: “Memory wall” Memory slow, multiplies fast (200 clocks to DRAM memory, 4 clocks for FP multiply)

• Old : Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)

• New CW: “ILP wall” diminishing returns on more ILP

• New: Power Wall + Memory Wall + ILP Wall = Brick Wall

– Old CW: Uniprocessor performance 2X / 1.5 yrs

– New CW: Uniprocessor performance only 2X / 5 yrs?

Conventional Wisdom (CW) in Computer Architecture

- Patterson

Page 5: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.
Page 6: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Multicore and Manycore Processors

• IBM Cell• NVidia GeForce 8800 includes 128 scalar

processors and Tesla• Sun T1 and T2• Tilera Tile64• Picochip combines 430 simple RISC cores• Cisco 188• TRIPS

Page 7: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Parallel Programming? Programming where concurrent executions

are explicitly specified, possibly in a high-level language.

Stake-holders Architects: Understand workloads Algorithm designers: Focus on designs for real

systems. Programmers: Understand performance issues

and engineer for better performance.

Page 8: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Parallel Programming 4 approaches

Extending an existing compiler. E.g. Fortran compiler

Extending an existing language with new constructs. E.g. MPI and OpenMP

Add a parallel programming layer. Not popular. Design a new parallel language and build a

compiler. Most difficult.

Page 9: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Parallel Programming How different from programming an uni-

processor? Program mostly fixed in the latter and is mostly

taken for granted. Other entities such as compilers and operating

system change but need not rewrite the source.

Page 10: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Parallel Programming Programs have to be written to suit the

available architecture. A continuous evolutionary model taking

into account parallel software and architecture.

Some Challenges More processors Memory hierarchy Scope for several optimizations/trade-offs.

e.g., communication.

Page 11: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Parallelization Process

Assume that a description of the sequential program is available.

Does the sequential program lend itself to direct parallelization? Enough cases where it does and where it does

not Will see an example of both.

Page 12: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Parallelization Process Identify tasks that can be done in parallel. Goal: To get a high-performance

implementation with reasonable effort and resources.

Who should do it? Compiler, OS, run-time system, programmer Different challenges in different approaches.

Page 13: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Parallelization Process – 4 Steps

1. Decomposition• Computation to tasks

2. Assignment• Task – Process assignment

3. Orchestration• Understand communication and

synchronization

4. Mapping• Map to physical processors

Page 14: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Decompos

i

t

i

on

As

s

i

gnmen

t

Orchest

rat

ion

Mapping

P1 P2

P3P4

Parallelization Process – In Pictures

Page 15: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Decomposition Break the computation into a collection of

tasks. Can have dynamic generation of tasks.

Goal is to expose as much concurrency as possible. Careful to keep the overhead manageable.

Page 16: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Decomposition Limitation: Available concurrency. Formalized as Amdahl’s law.

Let s be the fraction of operations in a computation that must be performed sequentially, with 0 s 1. The maximum speed-up achievable by a parallel computer is:

ψ≤ 1 s + 1−s / p

Page 17: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Decomposition Implications of Amdahl’s Law

Some processors may have to be idle due to the sequential nature of the program.

Also applicable to other resources. Quick Example: If 20% of the program is

sequential then the best speed up with 10 processors is limited to 1/(0.2+0.08) = 3.5

Amdahl’s Law: As p , the speed-up is bounded by 1/s.

Page 18: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Assignment Distribution of tasks among processes. Issue: Balance the load among the

processes. Load includes number of tasks and inter-

process communication. One has to be careful because inter-

process communication is expensive and load imbalance can affect performance.

Page 19: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Assignment: Static vs. Dynamic Static assignment:

Assignment completely specified at the beginning.

Does not change after that Useful for very structured applications.

Page 20: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Assignment: Static vs. Dynamic

Dynamic Assignment Assignment changes at runtime. Imagine a task pool. Has a chance to correct load imbalance. Useful for unstructured applications.

Page 21: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Orchestration Bring in the architecture, programming

model, and the programming language. Consider available mechanisms for

Data exchange Synchronization Inter-process communication Various programming model primitives and

their relative merits

Page 22: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Orchestration Data structures and their organization. Exploit temporal locality among tasks

assigned to a process by proper scheduling.

Implicit vs. explicit communication Size of messages.

Page 23: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Orchestration – Goals Preserving data locality Task scheduling to remove inter-task

waiting. Reduce the overhead of managing

parallelism.

Page 24: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Mapping Closer and specific to the system and the

programming environment. User controlled Which process runs on which processor?

Want an assignment that preserves locality of communication.

Page 25: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Mapping System controlled

The OS schedules processes on processors dynamically.

Processes may be migrated across processors In-between approach

Take user requests into account but the system may change it.

Page 26: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Parallelization Process – Summary

Of the 4 stages, decomposition and assignment are independent of architecture and programming language/environment.

Reduce IPC, inter-task dependence, synchronization

Yes3. Orchestration

Exploit communication locality

Yes4. Mapping

Load balancingMostly No2. Assignment

Expose enough concurrency

Mostly no1. Decomposition

GoalsArchitecture Dependent

Step

Page 27: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Rest of the Lecture

Concentrate on Steps 1 and 2 – These are

algorithmic in nature

Steps 3 and 4 : Programming in nature.

Mostly self-taught. Few inputs from my

side.

Page 28: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Decompos

i

t

i

on

As

s

i

gnmen

t

Orchest

rat

ion

Mapping

P1 P2

P3P4

Parallelization Process – In Pictures

Page 29: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

A similar View Along similar lines, proposed by Ian Foster:

Partitioning: Alike decomposition. Communication: Understand the communication

required by the partition. Agglomeration: Combine tasks to reduce

communication, preserve locality, ease programming effort.

Mapping: Map processes to processors. See Parallel Programming in C with MPI and

OpenMP, M. J. Quinn.

Page 30: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Foster’s Design Methodology

Partitioning

Communication

AgglomerationMappin

g

Page 31: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Example 1 – Sequential to Parallel

Matrix Multiplication

Listing 1: Sequential Codefor i = 1 to n do for j = 1 to n do C[i][j] = 0; for k = 1 to n do c[i][j] += A[i][k]*B[k][j] end endend

Page 32: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Matrix Multiplication

Easy to modify the sequential algorithm to

a parallel algorithm

Several techniques available Recursive approach

Sub-matrices in parallel

Rows/Columns in parallel

Page 33: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Example 2 – New Parallel Algorithm

Prefix Computations: Given an array A of n

elements and an associative operation o,

compute A(1) o A(2) o ... A(i) for each i.

A very simple sequential algorithm exists

for this problem.Listing 1:S(1) = A(1)for i = 2 to n do

S(i) = S(i-1) o A(i)

Page 34: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Parallel Prefix Computation

The sequential algorithm in Listing 1 is not

efficient in parallel.

Need a new algorithm approach. Balanced Binary Tree

Page 35: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Balanced Binary Tree

An algorithm design approach for parallel

algorithms

Many problems can be solved with this

design technique.

Easily amenable to parallellization and

analysis.

Page 36: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Balanced Binary Tree

A complete binary tree with processors at each

internal node.

Input is at the leaf nodes

Define operations to be executed at the

internal nodes. Input for this operation at a node are the values at

the children of this node.

Computation as a tree traversal from leaf to

root.

Page 37: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Balanced Binary Tree – Prefix Sums

a0 a1 a2 a3 a4 a5 a6 a7

+ + + +

+ +

+

Page 38: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Balanced Binary Tree – Sum

a0 a1 a2 a3 a4 a5 a6 a7

+ + + +

+ +

+

a0 + a1 a2 + a3 a4 + a5 a6 + a7

a0 + a1 + a2 + a3 a4 + a5 + a6 + a7

ai

Page 39: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Balanced Binary Tree – Sum

The above approach called as an

``upward traversal'' Data flow from the children to the root. Helpful in other situations also such as

computing the max, expression evaluation.

Analogously, can define a downward

traversal Data flow from root to leaf

Helps in settings such as element broadcast

Page 40: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Balanced Binary Tree

Can use a combination of both upward and

downward traversal.

Prefix computation requires that.

Illustration in the next slide.

Page 41: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Balanced Binary Tree – Sum

a1 a2 a3 a4 a5 a6 a7 a8

+ + + +

+ +

+

a1 + a2 a3 + a4 a5 + a6 a7 + a8

a1 + a2 + a3 + a4 a5 + a6 + a7 + a8

ai

Page 42: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Balanced Binary Tree – Prefix Sum

a1 a2 a3 a4 a5 a6 a7 a8

+ + + +

+ +

+

a1 + a2 a3 + a4 a5 + a6 a7 + a8

a1 + a2 + a3 + a4 a5 + a6 + a7 + a8

ai

Upward traversal

Page 43: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

a1 a2 a3 a4 a5 a6 a7 a8

+ + + +

+ +

+

a1 + a2 a3 + a4 a5 + a6 a7 + a8

a1 + a2 + a3 + a4 a5 + a6 + a7 + a8

ai

Downward traversal

– Even indices

Balanced Binary Tree – Prefix Sum

a1 + a2 a1+a2+

a3 + a4

i=16a

ia

i

a1 a1+a2 a1+a2+a3+a4 i=1

6ai

ai

Page 44: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

a1 a2 a3 a4 a5 a6 a7 a8

+ + + +

+ +

+

a1 + a2 a3 + a4 a5 + a6 a7 + a8

a1 + a2 + a3 + a4 a5 + a6 + a7 + a8

ai

Downward traversal

– Odd indices

Balanced Binary Tree – Prefix Sum

a1 + a2 a1+a2+

a3 + a4

i=16a

ia

i

a1 (a1+a2) + a3 i=1

4ai) + a5

i=16a

i) + a7

Page 45: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Balanced Binary Tree – Prefix Sums

Two traversals of a

complete binary tree.

The tree is only a visual aid. Map processors to locations in

the tree

Perform equivalent

computations.

Algorithm designed in the

PRAM model.

Works in logarithmic time,

and optimal number of

operations.

//upward traversal1. for i = 1 to n/2 do in parallel bi = a2i-2 o a2i

2. Recursively compute the prefix sums of B= (b1,

b2, ..., bn/2) and store them

in C = (c1, c2, ..., cn/2)

//downward traversal3. for i = 1 to n do in parallel i is even : si = ci

i= 1 : si = xi

i is odd : si = c(i-1)/2 o ai

Page 46: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

The PRAM Model

An extension of the von Neumann model.

P1 P2 P3 Pn

Global Shared Memory

Page 47: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

The PRAM Model

A set of n identical processors

A common access shared memory

Synchronous time steps

Access to the shared memory costs the

same as a unit of computation.

Different models to provide semantics for

concurrent access to the shared memory EREW, CREW, CRCW(Common, Aribitrary,

Priority, ...)

Page 48: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

PRAM Model – Advantages and Drawbacks

A simple model for

algorithm design

Hides architectural

details for the

designer.

A good starting

point

Ignores architectural

features such as

memory bandwidth,

communication cost

and latency,

scheduling, ...

Hardware may be

difficult to realize

Advantage

s

Disadvantag

es

Page 49: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

Other Models

The Network Model

P4

P1

P5

P7

P3

P2

P6

Graph G of

processors

Send/Receive

messages over

edges

Computation

through

communication.

Efficiency depends

on the graph G

P1

Page 50: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

The Network Model

There are a few disadvantages Algorithm has to change if the network changes.

Difficult to specify and design algorithms.

Page 51: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

More Design Paradigms

Divide and Conquer Alike the sequential design technique

Partitioning A case of divide and conquer where the

subproblems are independent of each other.

No need to combine solutions

Better suited for algorithms such as merging.

Path Doubling or Pointer Jumping Suitable where data is in linked lists

Page 52: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

More Design Paradigms

Accelerated Cascading A technique to combine two parallel algorithms

to get a better algorithm

Algorithm A could be very fast but does lot of

operations

Algorithm B is slow but is work-optimal.

Combine Algorithm A and Algorithm B and get

both advantages.

Page 53: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

References

Parallel Architectures and Programming,

Culler, Gupta, and Singh.

Parallel Programming in C with MPI and

OpenMP, M. J. Quinn.

Introduction to Parallel Algorithms, J. JaJa.

Page 54: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

List Ranking – Another Example

Process a linked list to answer the distance

of nodes from one end of the list.

Linked lists are a fundamental data

structure.

Page 55: Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute.

List Ranking – Another Example

Pointer jumping – 3

Ind. set based - 3