Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H [email protected] Workshop on Multi-core Technologies International Institute.

Parallel Programming and Algorithms : A Primer

Kishore KothapalliIIIT-H

[email protected]

Workshop on Multi-core TechnologiesInternational Institute of Information

TechnologyJuly 23 – 25, 2009, Hyderabad.

GRAND CHALLENGE PROBLEMS

• Global change• Human genome• Fluid turbulence• Vehicle dynamics• Ocean circulation• Viscous fluid dynamics• Superconductor modeling• Quantum chromo dynamics• Vision

APPLICATIONS

Nature of workloads.Computational and Storage demands of technical, scientific, digital media and business

applicationsFiner degrees of spatial and temporal resolutionA computational fluid dynamics(CFD) calculation on an airplane wing 512 X 64 X 256 grid 5000 fl-pt operations per grid point 5000 steps 2.1x1014 ft-ops. 3.5 minutes on a machine sustaining 1 trillion fl-ops A simulation of full aircraft 3.5 x 1017 grid points total of 8.7 x 1024 ft-pt operations on same machine requires more than 275,000 years

to complete.Simulation of magnetic materials at the level of 2000-atom systems require 2.64 Tflops of

computational power and 512 GB of storage. Full hard disk simulation 30 Tflops and 2 TB Current investigations limited about 1000 atoms 0.5 Tflops 250 GB Future investigations involving 10,000 atoms 100 Tflops 2.5TBDigital movies and special effects 1014 fl-pt operations per frame and 50 frames per

second 90-min movie represents 2.7 x 1019 fl-pt operations. It would take 2,000 1-Gflops CPUs approximately 150 days to complete the computation.

Inventory planning, risk analysis, workforce scheduling and chip design.

• Old CW: Power is free, Transistors expensive

• New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on)

• Old: Multiplies are slow, Memory access is fast

• New: “Memory wall” Memory slow, multiplies fast (200 clocks to DRAM memory, 4 clocks for FP multiply)

• Old : Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)

• New CW: “ILP wall” diminishing returns on more ILP

• New: Power Wall + Memory Wall + ILP Wall = Brick Wall

– Old CW: Uniprocessor performance 2X / 1.5 yrs

– New CW: Uniprocessor performance only 2X / 5 yrs?

Conventional Wisdom (CW) in Computer Architecture

- Patterson

Multicore and Manycore Processors

• IBM Cell• NVidia GeForce 8800 includes 128 scalar

processors and Tesla• Sun T1 and T2• Tilera Tile64• Picochip combines 430 simple RISC cores• Cisco 188• TRIPS

Parallel Programming? Programming where concurrent executions

are explicitly specified, possibly in a high-level language.

Stake-holders Architects: Understand workloads Algorithm designers: Focus on designs for real

systems. Programmers: Understand performance issues

and engineer for better performance.

Parallel Programming 4 approaches

Extending an existing compiler. E.g. Fortran compiler

Extending an existing language with new constructs. E.g. MPI and OpenMP

Add a parallel programming layer. Not popular. Design a new parallel language and build a

compiler. Most difficult.

Parallel Programming How different from programming an uni-

processor? Program mostly fixed in the latter and is mostly

taken for granted. Other entities such as compilers and operating

system change but need not rewrite the source.

Parallel Programming Programs have to be written to suit the

available architecture. A continuous evolutionary model taking

into account parallel software and architecture.

Some Challenges More processors Memory hierarchy Scope for several optimizations/trade-offs.

e.g., communication.

Parallelization Process

Assume that a description of the sequential program is available.

Does the sequential program lend itself to direct parallelization? Enough cases where it does and where it does

not Will see an example of both.

Parallelization Process Identify tasks that can be done in parallel. Goal: To get a high-performance

implementation with reasonable effort and resources.

Who should do it? Compiler, OS, run-time system, programmer Different challenges in different approaches.

Parallelization Process – 4 Steps

1. Decomposition• Computation to tasks

2. Assignment• Task – Process assignment

3. Orchestration• Understand communication and

synchronization

4. Mapping• Map to physical processors

Decompos

i

t

i

on

As

s

i

gnmen

t

Orchest

rat

ion

Mapping

P1 P2

P3P4

Parallelization Process – In Pictures

Decomposition Break the computation into a collection of

tasks. Can have dynamic generation of tasks.

Goal is to expose as much concurrency as possible. Careful to keep the overhead manageable.

Decomposition Limitation: Available concurrency. Formalized as Amdahl’s law.

Let s be the fraction of operations in a computation that must be performed sequentially, with 0 s 1. The maximum speed-up achievable by a parallel computer is:

ψ≤ 1 s + 1−s / p

Decomposition Implications of Amdahl’s Law

Some processors may have to be idle due to the sequential nature of the program.

Also applicable to other resources. Quick Example: If 20% of the program is

sequential then the best speed up with 10 processors is limited to 1/(0.2+0.08) = 3.5

Amdahl’s Law: As p , the speed-up is bounded by 1/s.

Assignment Distribution of tasks among processes. Issue: Balance the load among the

processes. Load includes number of tasks and inter-

process communication. One has to be careful because inter-

process communication is expensive and load imbalance can affect performance.

Assignment: Static vs. Dynamic Static assignment:

Assignment completely specified at the beginning.

Does not change after that Useful for very structured applications.

Assignment: Static vs. Dynamic

Dynamic Assignment Assignment changes at runtime. Imagine a task pool. Has a chance to correct load imbalance. Useful for unstructured applications.

Orchestration Bring in the architecture, programming

model, and the programming language. Consider available mechanisms for

Data exchange Synchronization Inter-process communication Various programming model primitives and

their relative merits

Orchestration Data structures and their organization. Exploit temporal locality among tasks

assigned to a process by proper scheduling.

Implicit vs. explicit communication Size of messages.

Orchestration – Goals Preserving data locality Task scheduling to remove inter-task

waiting. Reduce the overhead of managing

parallelism.

Mapping Closer and specific to the system and the

programming environment. User controlled Which process runs on which processor?

Want an assignment that preserves locality of communication.

Mapping System controlled

The OS schedules processes on processors dynamically.

Processes may be migrated across processors In-between approach

Take user requests into account but the system may change it.

Parallelization Process – Summary

Of the 4 stages, decomposition and assignment are independent of architecture and programming language/environment.

Reduce IPC, inter-task dependence, synchronization

Yes3. Orchestration

Exploit communication locality

Yes4. Mapping

Load balancingMostly No2. Assignment

Expose enough concurrency

Mostly no1. Decomposition

GoalsArchitecture Dependent

Step

Rest of the Lecture

Concentrate on Steps 1 and 2 – These are

algorithmic in nature

Steps 3 and 4 : Programming in nature.

Mostly self-taught. Few inputs from my

side.

Decompos

i

t

i

on

As

s

i

gnmen

t

Orchest

rat

ion

Mapping

P1 P2

P3P4

Parallelization Process – In Pictures

A similar View Along similar lines, proposed by Ian Foster:

Partitioning: Alike decomposition. Communication: Understand the communication

required by the partition. Agglomeration: Combine tasks to reduce

communication, preserve locality, ease programming effort.

Mapping: Map processes to processors. See Parallel Programming in C with MPI and

OpenMP, M. J. Quinn.

Foster’s Design Methodology

Partitioning

Communication

AgglomerationMappin

g

Example 1 – Sequential to Parallel

Matrix Multiplication

Listing 1: Sequential Codefor i = 1 to n do for j = 1 to n do C[i][j] = 0; for k = 1 to n do c[i][j] += A[i][k]*B[k][j] end endend

Matrix Multiplication

Easy to modify the sequential algorithm to

a parallel algorithm

Several techniques available Recursive approach

Sub-matrices in parallel

Rows/Columns in parallel

Example 2 – New Parallel Algorithm

Prefix Computations: Given an array A of n

elements and an associative operation o,

compute A(1) o A(2) o ... A(i) for each i.

A very simple sequential algorithm exists

for this problem.Listing 1:S(1) = A(1)for i = 2 to n do

S(i) = S(i-1) o A(i)

Parallel Prefix Computation

The sequential algorithm in Listing 1 is not

efficient in parallel.

Need a new algorithm approach. Balanced Binary Tree

Balanced Binary Tree

An algorithm design approach for parallel

algorithms

Many problems can be solved with this

design technique.

Easily amenable to parallellization and

analysis.


A complete binary tree with processors at each

internal node.

Input is at the leaf nodes

Define operations to be executed at the

internal nodes. Input for this operation at a node are the values at

the children of this node.

Computation as a tree traversal from leaf to

root.

Balanced Binary Tree – Prefix Sums

a0 a1 a2 a3 a4 a5 a6 a7

+ + + +

+ +

+

Balanced Binary Tree – Sum

a0 a1 a2 a3 a4 a5 a6 a7

+ + + +

+ +

+

a0 + a1 a2 + a3 a4 + a5 a6 + a7

a0 + a1 + a2 + a3 a4 + a5 + a6 + a7

ai


The above approach called as an

``upward traversal'' Data flow from the children to the root. Helpful in other situations also such as

computing the max, expression evaluation.

Analogously, can define a downward

traversal Data flow from root to leaf

Helps in settings such as element broadcast


Can use a combination of both upward and

downward traversal.

Prefix computation requires that.

Illustration in the next slide.


a1 a2 a3 a4 a5 a6 a7 a8

+ + + +

+ +

+

a1 + a2 a3 + a4 a5 + a6 a7 + a8

a1 + a2 + a3 + a4 a5 + a6 + a7 + a8

ai

Balanced Binary Tree – Prefix Sum

a1 a2 a3 a4 a5 a6 a7 a8

+ + + +

+ +

+

a1 + a2 a3 + a4 a5 + a6 a7 + a8

a1 + a2 + a3 + a4 a5 + a6 + a7 + a8

ai

Upward traversal

a1 a2 a3 a4 a5 a6 a7 a8

+ + + +

+ +

+

a1 + a2 a3 + a4 a5 + a6 a7 + a8

a1 + a2 + a3 + a4 a5 + a6 + a7 + a8

ai

Downward traversal

– Even indices


a1 + a2 a1+a2+

a3 + a4

i=16a

ia

i

a1 a1+a2 a1+a2+a3+a4 i=1

6ai

ai

a1 a2 a3 a4 a5 a6 a7 a8

+ + + +

+ +

+

a1 + a2 a3 + a4 a5 + a6 a7 + a8

a1 + a2 + a3 + a4 a5 + a6 + a7 + a8

ai

Downward traversal

– Odd indices


a1 + a2 a1+a2+

a3 + a4

i=16a

ia

i

a1 (a1+a2) + a3 i=1

4ai) + a5

i=16a

i) + a7

Balanced Binary Tree – Prefix Sums

Two traversals of a

complete binary tree.

The tree is only a visual aid. Map processors to locations in

the tree

Perform equivalent

computations.

Algorithm designed in the

PRAM model.

Works in logarithmic time,

and optimal number of

operations.

//upward traversal1. for i = 1 to n/2 do in parallel bi = a2i-2 o a2i

2. Recursively compute the prefix sums of B= (b1,

b2, ..., bn/2) and store them

in C = (c1, c2, ..., cn/2)

//downward traversal3. for i = 1 to n do in parallel i is even : si = ci

i= 1 : si = xi

i is odd : si = c(i-1)/2 o ai

The PRAM Model

An extension of the von Neumann model.

P1 P2 P3 Pn

Global Shared Memory

The PRAM Model

A set of n identical processors

A common access shared memory

Synchronous time steps

Access to the shared memory costs the

same as a unit of computation.

Different models to provide semantics for

concurrent access to the shared memory EREW, CREW, CRCW(Common, Aribitrary,

Priority, ...)

PRAM Model – Advantages and Drawbacks

A simple model for

algorithm design

Hides architectural

details for the

designer.

A good starting

point

Ignores architectural

features such as

memory bandwidth,

communication cost

and latency,

scheduling, ...

Hardware may be

difficult to realize

Advantage

s

Disadvantag

es

Other Models

The Network Model

P4

P1

P5

P7

P3

P2

P6

Graph G of

processors

Send/Receive

messages over

edges

Computation

through

communication.

Efficiency depends

on the graph G

P1

The Network Model

There are a few disadvantages Algorithm has to change if the network changes.

Difficult to specify and design algorithms.

More Design Paradigms

Divide and Conquer Alike the sequential design technique

Partitioning A case of divide and conquer where the

subproblems are independent of each other.

No need to combine solutions

Better suited for algorithms such as merging.

Path Doubling or Pointer Jumping Suitable where data is in linked lists

More Design Paradigms

Accelerated Cascading A technique to combine two parallel algorithms

to get a better algorithm

Algorithm A could be very fast but does lot of

operations

Algorithm B is slow but is work-optimal.

Combine Algorithm A and Algorithm B and get

both advantages.

References

Parallel Architectures and Programming,

Culler, Gupta, and Singh.

Parallel Programming in C with MPI and

OpenMP, M. J. Quinn.

Introduction to Parallel Algorithms, J. JaJa.

List Ranking – Another Example

Process a linked list to answer the distance

of nodes from one end of the list.

Linked lists are a fundamental data

structure.

List Ranking – Another Example

Pointer jumping – 3

Ind. set based - 3

Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H [email protected] Workshop on Multi-core Technologies International Institute.

Documents

new parallel language

ilp new

yrs new cw

trips slide

power wall power expensive

memory wall memory slow

parallel programming

brick wall old cw