Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H [email protected] Workshop on Multi-core Technologies International Institute of Information Technology July 23 – 25, 2009, Hyderabad.
Dec 14, 2015
Parallel Programming and Algorithms : A Primer
Kishore KothapalliIIIT-H
Workshop on Multi-core TechnologiesInternational Institute of Information
TechnologyJuly 23 – 25, 2009, Hyderabad.
GRAND CHALLENGE PROBLEMS
• Global change• Human genome• Fluid turbulence• Vehicle dynamics• Ocean circulation• Viscous fluid dynamics• Superconductor modeling• Quantum chromo dynamics• Vision
APPLICATIONS
Nature of workloads.Computational and Storage demands of technical, scientific, digital media and business
applicationsFiner degrees of spatial and temporal resolutionA computational fluid dynamics(CFD) calculation on an airplane wing 512 X 64 X 256 grid 5000 fl-pt operations per grid point 5000 steps 2.1x1014 ft-ops. 3.5 minutes on a machine sustaining 1 trillion fl-ops A simulation of full aircraft 3.5 x 1017 grid points total of 8.7 x 1024 ft-pt operations on same machine requires more than 275,000 years
to complete.Simulation of magnetic materials at the level of 2000-atom systems require 2.64 Tflops of
computational power and 512 GB of storage. Full hard disk simulation 30 Tflops and 2 TB Current investigations limited about 1000 atoms 0.5 Tflops 250 GB Future investigations involving 10,000 atoms 100 Tflops 2.5TBDigital movies and special effects 1014 fl-pt operations per frame and 50 frames per
second 90-min movie represents 2.7 x 1019 fl-pt operations. It would take 2,000 1-Gflops CPUs approximately 150 days to complete the computation.
Inventory planning, risk analysis, workforce scheduling and chip design.
• Old CW: Power is free, Transistors expensive
• New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on)
• Old: Multiplies are slow, Memory access is fast
• New: “Memory wall” Memory slow, multiplies fast (200 clocks to DRAM memory, 4 clocks for FP multiply)
• Old : Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)
• New CW: “ILP wall” diminishing returns on more ILP
• New: Power Wall + Memory Wall + ILP Wall = Brick Wall
– Old CW: Uniprocessor performance 2X / 1.5 yrs
– New CW: Uniprocessor performance only 2X / 5 yrs?
Conventional Wisdom (CW) in Computer Architecture
- Patterson
Multicore and Manycore Processors
• IBM Cell• NVidia GeForce 8800 includes 128 scalar
processors and Tesla• Sun T1 and T2• Tilera Tile64• Picochip combines 430 simple RISC cores• Cisco 188• TRIPS
Parallel Programming? Programming where concurrent executions
are explicitly specified, possibly in a high-level language.
Stake-holders Architects: Understand workloads Algorithm designers: Focus on designs for real
systems. Programmers: Understand performance issues
and engineer for better performance.
Parallel Programming 4 approaches
Extending an existing compiler. E.g. Fortran compiler
Extending an existing language with new constructs. E.g. MPI and OpenMP
Add a parallel programming layer. Not popular. Design a new parallel language and build a
compiler. Most difficult.
Parallel Programming How different from programming an uni-
processor? Program mostly fixed in the latter and is mostly
taken for granted. Other entities such as compilers and operating
system change but need not rewrite the source.
Parallel Programming Programs have to be written to suit the
available architecture. A continuous evolutionary model taking
into account parallel software and architecture.
Some Challenges More processors Memory hierarchy Scope for several optimizations/trade-offs.
e.g., communication.
Parallelization Process
Assume that a description of the sequential program is available.
Does the sequential program lend itself to direct parallelization? Enough cases where it does and where it does
not Will see an example of both.
Parallelization Process Identify tasks that can be done in parallel. Goal: To get a high-performance
implementation with reasonable effort and resources.
Who should do it? Compiler, OS, run-time system, programmer Different challenges in different approaches.
Parallelization Process – 4 Steps
1. Decomposition• Computation to tasks
2. Assignment• Task – Process assignment
3. Orchestration• Understand communication and
synchronization
4. Mapping• Map to physical processors
Decompos
i
t
i
on
As
s
i
gnmen
t
Orchest
rat
ion
Mapping
P1 P2
P3P4
Parallelization Process – In Pictures
Decomposition Break the computation into a collection of
tasks. Can have dynamic generation of tasks.
Goal is to expose as much concurrency as possible. Careful to keep the overhead manageable.
Decomposition Limitation: Available concurrency. Formalized as Amdahl’s law.
Let s be the fraction of operations in a computation that must be performed sequentially, with 0 s 1. The maximum speed-up achievable by a parallel computer is:
ψ≤ 1 s + 1−s / p
Decomposition Implications of Amdahl’s Law
Some processors may have to be idle due to the sequential nature of the program.
Also applicable to other resources. Quick Example: If 20% of the program is
sequential then the best speed up with 10 processors is limited to 1/(0.2+0.08) = 3.5
Amdahl’s Law: As p , the speed-up is bounded by 1/s.
Assignment Distribution of tasks among processes. Issue: Balance the load among the
processes. Load includes number of tasks and inter-
process communication. One has to be careful because inter-
process communication is expensive and load imbalance can affect performance.
Assignment: Static vs. Dynamic Static assignment:
Assignment completely specified at the beginning.
Does not change after that Useful for very structured applications.
Assignment: Static vs. Dynamic
Dynamic Assignment Assignment changes at runtime. Imagine a task pool. Has a chance to correct load imbalance. Useful for unstructured applications.
Orchestration Bring in the architecture, programming
model, and the programming language. Consider available mechanisms for
Data exchange Synchronization Inter-process communication Various programming model primitives and
their relative merits
Orchestration Data structures and their organization. Exploit temporal locality among tasks
assigned to a process by proper scheduling.
Implicit vs. explicit communication Size of messages.
Orchestration – Goals Preserving data locality Task scheduling to remove inter-task
waiting. Reduce the overhead of managing
parallelism.
Mapping Closer and specific to the system and the
programming environment. User controlled Which process runs on which processor?
Want an assignment that preserves locality of communication.
Mapping System controlled
The OS schedules processes on processors dynamically.
Processes may be migrated across processors In-between approach
Take user requests into account but the system may change it.
Parallelization Process – Summary
Of the 4 stages, decomposition and assignment are independent of architecture and programming language/environment.
Reduce IPC, inter-task dependence, synchronization
Yes3. Orchestration
Exploit communication locality
Yes4. Mapping
Load balancingMostly No2. Assignment
Expose enough concurrency
Mostly no1. Decomposition
GoalsArchitecture Dependent
Step
Rest of the Lecture
Concentrate on Steps 1 and 2 – These are
algorithmic in nature
Steps 3 and 4 : Programming in nature.
Mostly self-taught. Few inputs from my
side.
Decompos
i
t
i
on
As
s
i
gnmen
t
Orchest
rat
ion
Mapping
P1 P2
P3P4
Parallelization Process – In Pictures
A similar View Along similar lines, proposed by Ian Foster:
Partitioning: Alike decomposition. Communication: Understand the communication
required by the partition. Agglomeration: Combine tasks to reduce
communication, preserve locality, ease programming effort.
Mapping: Map processes to processors. See Parallel Programming in C with MPI and
OpenMP, M. J. Quinn.
Foster’s Design Methodology
Partitioning
Communication
AgglomerationMappin
g
Example 1 – Sequential to Parallel
Matrix Multiplication
Listing 1: Sequential Codefor i = 1 to n do for j = 1 to n do C[i][j] = 0; for k = 1 to n do c[i][j] += A[i][k]*B[k][j] end endend
Matrix Multiplication
Easy to modify the sequential algorithm to
a parallel algorithm
Several techniques available Recursive approach
Sub-matrices in parallel
Rows/Columns in parallel
Example 2 – New Parallel Algorithm
Prefix Computations: Given an array A of n
elements and an associative operation o,
compute A(1) o A(2) o ... A(i) for each i.
A very simple sequential algorithm exists
for this problem.Listing 1:S(1) = A(1)for i = 2 to n do
S(i) = S(i-1) o A(i)
Parallel Prefix Computation
The sequential algorithm in Listing 1 is not
efficient in parallel.
Need a new algorithm approach. Balanced Binary Tree
Balanced Binary Tree
An algorithm design approach for parallel
algorithms
Many problems can be solved with this
design technique.
Easily amenable to parallellization and
analysis.
Balanced Binary Tree
A complete binary tree with processors at each
internal node.
Input is at the leaf nodes
Define operations to be executed at the
internal nodes. Input for this operation at a node are the values at
the children of this node.
Computation as a tree traversal from leaf to
root.
Balanced Binary Tree – Prefix Sums
a0 a1 a2 a3 a4 a5 a6 a7
+ + + +
+ +
+
Balanced Binary Tree – Sum
a0 a1 a2 a3 a4 a5 a6 a7
+ + + +
+ +
+
a0 + a1 a2 + a3 a4 + a5 a6 + a7
a0 + a1 + a2 + a3 a4 + a5 + a6 + a7
ai
Balanced Binary Tree – Sum
The above approach called as an
``upward traversal'' Data flow from the children to the root. Helpful in other situations also such as
computing the max, expression evaluation.
Analogously, can define a downward
traversal Data flow from root to leaf
Helps in settings such as element broadcast
Balanced Binary Tree
Can use a combination of both upward and
downward traversal.
Prefix computation requires that.
Illustration in the next slide.
Balanced Binary Tree – Sum
a1 a2 a3 a4 a5 a6 a7 a8
+ + + +
+ +
+
a1 + a2 a3 + a4 a5 + a6 a7 + a8
a1 + a2 + a3 + a4 a5 + a6 + a7 + a8
ai
Balanced Binary Tree – Prefix Sum
a1 a2 a3 a4 a5 a6 a7 a8
+ + + +
+ +
+
a1 + a2 a3 + a4 a5 + a6 a7 + a8
a1 + a2 + a3 + a4 a5 + a6 + a7 + a8
ai
Upward traversal
a1 a2 a3 a4 a5 a6 a7 a8
+ + + +
+ +
+
a1 + a2 a3 + a4 a5 + a6 a7 + a8
a1 + a2 + a3 + a4 a5 + a6 + a7 + a8
ai
Downward traversal
– Even indices
Balanced Binary Tree – Prefix Sum
a1 + a2 a1+a2+
a3 + a4
i=16a
ia
i
a1 a1+a2 a1+a2+a3+a4 i=1
6ai
ai
a1 a2 a3 a4 a5 a6 a7 a8
+ + + +
+ +
+
a1 + a2 a3 + a4 a5 + a6 a7 + a8
a1 + a2 + a3 + a4 a5 + a6 + a7 + a8
ai
Downward traversal
– Odd indices
Balanced Binary Tree – Prefix Sum
a1 + a2 a1+a2+
a3 + a4
i=16a
ia
i
a1 (a1+a2) + a3 i=1
4ai) + a5
i=16a
i) + a7
Balanced Binary Tree – Prefix Sums
Two traversals of a
complete binary tree.
The tree is only a visual aid. Map processors to locations in
the tree
Perform equivalent
computations.
Algorithm designed in the
PRAM model.
Works in logarithmic time,
and optimal number of
operations.
//upward traversal1. for i = 1 to n/2 do in parallel bi = a2i-2 o a2i
2. Recursively compute the prefix sums of B= (b1,
b2, ..., bn/2) and store them
in C = (c1, c2, ..., cn/2)
//downward traversal3. for i = 1 to n do in parallel i is even : si = ci
i= 1 : si = xi
i is odd : si = c(i-1)/2 o ai
The PRAM Model
An extension of the von Neumann model.
P1 P2 P3 Pn
Global Shared Memory
The PRAM Model
A set of n identical processors
A common access shared memory
Synchronous time steps
Access to the shared memory costs the
same as a unit of computation.
Different models to provide semantics for
concurrent access to the shared memory EREW, CREW, CRCW(Common, Aribitrary,
Priority, ...)
PRAM Model – Advantages and Drawbacks
A simple model for
algorithm design
Hides architectural
details for the
designer.
A good starting
point
Ignores architectural
features such as
memory bandwidth,
communication cost
and latency,
scheduling, ...
Hardware may be
difficult to realize
Advantage
s
Disadvantag
es
Other Models
The Network Model
P4
P1
P5
P7
P3
P2
P6
Graph G of
processors
Send/Receive
messages over
edges
Computation
through
communication.
Efficiency depends
on the graph G
P1
The Network Model
There are a few disadvantages Algorithm has to change if the network changes.
Difficult to specify and design algorithms.
More Design Paradigms
Divide and Conquer Alike the sequential design technique
Partitioning A case of divide and conquer where the
subproblems are independent of each other.
No need to combine solutions
Better suited for algorithms such as merging.
Path Doubling or Pointer Jumping Suitable where data is in linked lists
More Design Paradigms
Accelerated Cascading A technique to combine two parallel algorithms
to get a better algorithm
Algorithm A could be very fast but does lot of
operations
Algorithm B is slow but is work-optimal.
Combine Algorithm A and Algorithm B and get
both advantages.
References
Parallel Architectures and Programming,
Culler, Gupta, and Singh.
Parallel Programming in C with MPI and
OpenMP, M. J. Quinn.
Introduction to Parallel Algorithms, J. JaJa.
List Ranking – Another Example
Process a linked list to answer the distance
of nodes from one end of the list.
Linked lists are a fundamental data
structure.
List Ranking – Another Example
Pointer jumping – 3
Ind. set based - 3