Day 2

Day 2

Agenda

• Parallelism basics

• Parallel machines

• Parallelism again

• High Throughput Computing– Finding the right grain size

One thing to remember

Easy Hard

Seeking Concurrency

• Data dependence graphs

• Data parallelism

• Functional parallelism

• Pipelining

Data Dependence Graph

• Directed graph

• Vertices = tasks

• Edges = dependences

Data Parallelism

• Independent tasks apply same operation to different elements of a data set

• Okay to perform operations concurrently

for i 0 to 99 do a[i] b[i] + c[i]endfor

Functional Parallelism

• Independent tasks apply different operations to different data elements

• First and second statements• Third and fourth statements

a 2b 3m (a + b) / 2s (a2 + b2) / 2v s - m2

Pipelining

• Divide a process into stages

• Produce several items simultaneously

Data Clustering

• Data mining = looking for meaningful patterns in large data sets

• Data clustering = organizing a data set into clusters of “similar” items

• Data clustering can speed retrieval of related items

Document Vectors

Moon

Rocket

Alice in Wonderland

A Biography of Jules Verne

The Geology of Moon Rocks

The Story of Apollo 11

Document Clustering

Clustering Algorithm

• Compute document vectors• Choose initial cluster centers• Repeat

– Compute performance function– Adjust centers

• Until function value converges or max iterations have elapsed

• Output cluster centers

Data Parallelism Opportunities

• Operation being applied to a data set

• Examples– Generating document vectors– Finding closest center to each vector– Picking initial values of cluster centers

Functional Parallelism Opportunities

• Draw data dependence diagram

• Look for sets of nodes such that there are no paths from one node to another

Data Dependence Diagram

Build document vectors

Compute function value

Choose cluster centers

Adjust cluster centers Output cluster centers

Programming Parallel Computers

• Extend compilers: translate sequential programs into parallel programs

• Extend languages: add parallel operations

• Add parallel language layer on top of sequential language

• Define totally new parallel language and compiler system

Strategy 1: Extend Compilers

• Parallelizing compiler– Detect parallelism in sequential program– Produce parallel executable program

• Focus on making Fortran programs parallel

Extend Compilers (cont.)

• Advantages– Can leverage millions of lines of existing serial

programs– Saves time and labor– Requires no retraining of programmers– Sequential programming easier than parallel

programming

Extend Compilers (cont.)

• Disadvantages– Parallelism may be irretrievably lost when

programs written in sequential languages– Performance of parallelizing compilers on

broad range of applications still up in air

Extend Language

• Add functions to a sequential language– Create and terminate processes– Synchronize processes– Allow processes to communicate

Extend Language (cont.)

• Advantages– Easiest, quickest, and least expensive– Allows existing compiler technology to be

leveraged– New libraries can be ready soon after new

parallel computers are available

Extend Language (cont.)

• Disadvantages– Lack of compiler support to catch errors– Easy to write programs that are difficult to

debug

Add a Parallel Programming Layer

• Lower layer– Core of computation– Process manipulates its portion of data to produce its

portion of result

• Upper layer– Creation and synchronization of processes– Partitioning of data among processes

• A few research prototypes have been built based on these principles

Create a Parallel Language

• Develop a parallel language “from scratch”– occam is an example

• Add parallel constructs to an existing language– Fortran 90– High Performance Fortran– C*

New Parallel Languages (cont.)

• Advantages– Allows programmer to communicate parallelism to

compiler– Improves probability that executable will achieve high

performance

• Disadvantages– Requires development of new compilers– New languages may not become standards– Programmer resistance

Current Status

• Low-level approach is most popular– Augment existing language with low-level parallel

constructs– MPI and OpenMP are examples

• Advantages of low-level approach– Efficiency– Portability

• Disadvantage: More difficult to program and debug

Architectures

• Interconnection networks

• Processor arrays (SIMD/data parallel)

• Multiprocessors (shared memory)

• Multicomputers (distributed memory)

• Flynn’s taxonomy

Interconnection Networks

• Uses of interconnection networks– Connect processors to shared memory– Connect processors to each other

• Interconnection media types– Shared medium– Switched medium

Shared versus Switched Media

Shared Medium

• Allows only message at a time

• Messages are broadcast

• Each processor “listens” to every message

• Arbitration is decentralized

• Collisions require resending of messages

• Ethernet is an example

Switched Medium

• Supports point-to-point messages between pairs of processors

• Each processor has its own path to switch• Advantages over shared media

– Allows multiple messages to be sent simultaneously

– Allows scaling of network to accommodate increase in processors

Switch Network Topologies

• View switched network as a graph– Vertices = processors or switches– Edges = communication paths

• Two kinds of topologies– Direct– Indirect

Direct Topology

• Ratio of switch nodes to processor nodes is 1:1

• Every switch node is connected to– 1 processor node– At least 1 other switch node

Indirect Topology

• Ratio of switch nodes to processor nodes is greater than 1:1

• Some switches simply connect other switches

Evaluating Switch Topologies

• Diameter

• Bisection width

• Number of edges / node

• Constant edge length? (yes/no)

2-D Mesh Network

• Direct topology

• Switches arranged into a 2-D lattice

• Communication allowed only between neighboring switches

• Variants allow wraparound connections between switches on edge of mesh

2-D Meshes

Vector Computers

• Vector computer: instruction set includes operations on vectors as well as scalars

• Two ways to implement vector computers– Pipelined vector processor: streams data

through pipelined arithmetic units– Processor array: many identical, synchronized

arithmetic processing elements

Why Processor Arrays?

• Historically, high cost of a control unit

• Scientific applications have data parallelism

Processor Array

Data/instruction Storage

• Front end computer– Program– Data manipulated sequentially

• Processor array– Data manipulated in parallel

Processor Array Performance

• Performance: work done per time unit

• Performance of processor array– Speed of processing elements– Utilization of processing elements

Performance Example 1

• 1024 processors

• Each adds a pair of integers in 1 sec

• What is performance when adding two 1024-element vectors (one per processor)?

sec/ops10024.1ePerformanc 9sec1

operations1024

Performance Example 2

• 512 processors

• Each adds two integers in 1 sec

• Performance adding two vectors of length 600?

sec/ops103ePerformanc 6sec2

operations600

2-D Processor Interconnection Network

Each VLSI chip has 16 processing elements

if (COND) then A else B



Processor Array Shortcomings

• Not all problems are data-parallel

• Speed drops for conditionally executed code

• Don’t adapt to multiple users well

• Do not scale down well to “starter” systems

• Rely on custom VLSI for processors

• Expense of control units has dropped

Multicomputer, aka Distributed Memory Machines

• Distributed memory multiple-CPU computer• Same address on different processors

refers to different physical memory locations• Processors interact through message

passing• Commercial multicomputers• Commodity clusters

Asymmetrical Multicomputer

Asymmetrical MC Advantages

• Back-end processors dedicated to parallel computations Easier to understand, model, tune performance

• Only a simple back-end operating system needed Easy for a vendor to create

Asymmetrical MC Disadvantages

• Front-end computer is a single point of failure

• Single front-end computer limits scalability of system

• Primitive operating system in back-end processors makes debugging difficult

• Every application requires development of both front-end and back-end program

Symmetrical Multicomputer

Symmetrical MC Advantages

• Alleviate performance bottleneck caused by single front-end computer

• Better support for debugging

• Every processor executes same program

Symmetrical MC Disadvantages

• More difficult to maintain illusion of single “parallel computer”

• No simple way to balance program development workload among processors

• More difficult to achieve high performance when multiple processes on each processor

Commodity Cluster

• Co-located computers

• Dedicated to running parallel jobs

• No keyboards or displays

• Identical operating system

• Identical local disk images

• Administered as an entity

Network of Workstations

• Dispersed computers

• First priority: person at keyboard

• Parallel jobs run in background

• Different operating systems

• Different local images

• Checkpointing and restarting important

DM programming model

• Communicating sequential programs

• Disjoint address spaces

• Communicate sending “messages”

• A message is an array of bytes– Send(dest, char *buf, in len);– receive(&dest, char *buf, int &len);

Multiprocessors

• Multiprocessor: multiple-CPU computer with a shared memory

• Same address on two different CPUs refers to the same memory location

• Avoid three problems of processor arrays– Can be built from commodity CPUs– Naturally support multiple users– Maintain efficiency in conditional code

Centralized Multiprocessor

• Straightforward extension of uniprocessor

• Add CPUs to bus

• All processors share same primary memory

• Memory access time same for all CPUs– Uniform memory access (UMA)

multiprocessor– Symmetrical multiprocessor (SMP)

Centralized Multiprocessor

Private and Shared Data

• Private data: items used only by a single processor

• Shared data: values used by multiple processors

• In a multiprocessor, processors communicate via shared data values

Problems Associated with Shared Data

• Cache coherence– Replicating data across multiple caches

reduces contention– How to ensure different processors have

same value for same address?

• Synchronization– Mutual exclusion– Barrier

Cache-coherence Problem

Cache

CPU A

Cache

CPU B

Memory

7X


CPU A CPU B

Memory

7X

7


CPU A CPU B

Memory

7X

7 7


CPU A CPU B

Memory

2X

7 2

Write Invalidate Protocol

CPU A CPU B

7X

7 7 Cache control monitor


CPU A CPU B

7X

7 7

Intent to write X


CPU A CPU B

7X

7

Intent to write X


CPU A CPU B

X 2

2

Distributed Multiprocessor

• Distribute primary memory among processors

• Increase aggregate memory bandwidth and lower average memory access time

• Allow greater number of processors

• Also called non-uniform memory access (NUMA) multiprocessor

Distributed Multiprocessor

Cache Coherence

• Some NUMA multiprocessors do not support it in hardware– Only instructions, private data in cache– Large memory access time variance

• Implementation more difficult– No shared memory bus to “snoop”– Directory-based protocol needed

Flynn’s Taxonomy

• Instruction stream• Data stream• Single vs. multiple• Four combinations

– SISD– SIMD– MISD– MIMD

SISD

• Single Instruction, Single Data

• Single-CPU systems

• Note: co-processors don’t count– Functional– I/O

• Example: PCs

SIMD

• Single Instruction, Multiple Data

• Two architectures fit this category– Pipelined vector processor

(e.g., Cray-1)– Processor array

(e.g., Connection Machine)

MISD

• MultipleInstruction,Single Data

• Example:systolic array

MIMD

• Multiple Instruction, Multiple Data

• Multiple-CPU computers– Multiprocessors– Multicomputers

Summary

• Commercial parallel computers appearedin 1980s

• Multiple-CPU computers now dominate

• Small-scale: Centralized multiprocessors

• Large-scale: Distributed memory architectures (multiprocessors or multicomputers)

Programming the Beast

• Task/channel model

• Algorithm design methodology

• Case studies

Task/Channel Model

• Parallel computation = set of tasks

• Task– Program– Local memory– Collection of I/O ports

• Tasks interact by sending messages through channels

Task/Channel Model

TaskTaskChannelChannel

Foster’s Design Methodology

• Partitioning

• Communication

• Agglomeration

• Mapping

Foster’s Methodology

P ro b lemP artitio ning

C o m m unic atio n

A gglo m eratio nM ap p ing

Partitioning

• Dividing computation and data into pieces• Domain decomposition

– Divide data into pieces– Determine how to associate computations with the

data

• Functional decomposition– Divide computation into pieces– Determine how to associate data with the

computations

Example Domain Decompositions

Example Functional Decomposition

Partitioning Checklist

• At least 10x more primitive tasks than processors in target computer

• Minimize redundant computations and redundant data storage

• Primitive tasks roughly the same size

• Number of tasks an increasing function of problem size

Communication

• Determine values passed among tasks• Local communication

– Task needs values from a small number of other tasks

– Create channels illustrating data flow

• Global communication– Significant number of tasks contribute data to perform

a computation– Don’t create channels for them early in design

Communication Checklist

• Communication operations balanced among tasks

• Each task communicates with only small group of neighbors

• Tasks can perform communications concurrently

• Task can perform computations concurrently

Agglomeration

• Grouping tasks into larger tasks

• Goals– Improve performance– Maintain scalability of program– Simplify programming

• In MPI programming, goal often to create one agglomerated task per processor

Agglomeration Can Improve Performance

• Eliminate communication between primitive tasks agglomerated into consolidated task

• Combine groups of sending and receiving tasks

Agglomeration Checklist

• Locality of parallel algorithm has increased• Replicated computations take less time than

communications they replace• Data replication doesn’t affect scalability• Agglomerated tasks have similar computational

and communications costs• Number of tasks increases with problem size• Number of tasks suitable for likely target systems• Tradeoff between agglomeration and code

modifications costs is reasonable

Mapping

• Process of assigning tasks to processors• Centralized multiprocessor: mapping done

by operating system• Distributed memory system: mapping

done by user• Conflicting goals of mapping

– Maximize processor utilization– Minimize interprocessor communication

Mapping Example

Optimal Mapping

• Finding optimal mapping is NP-hard

• Must rely on heuristics

Mapping Decision Tree

• Static number of tasks– Structured communication

• Constant computation time per task– Agglomerate tasks to minimize comm– Create one task per processor

• Variable computation time per task– Cyclically map tasks to processors

– Unstructured communication– Use a static load balancing algorithm

• Dynamic number of tasks

Mapping Strategy

• Static number of tasks• Dynamic number of tasks

– Frequent communications between tasks• Use a dynamic load balancing algorithm

– Many short-lived tasks• Use a run-time task-scheduling algorithm

Mapping Checklist

• Considered designs based on one task per processor and multiple tasks per processor

• Evaluated static and dynamic task allocation• If dynamic task allocation chosen, task

allocator is not a bottleneck to performance• If static task allocation chosen, ratio of tasks

to processors is at least 10:1

Case Studies

• Boundary value problem

• Finding the maximum

• The n-body problem

• Adding data input

Boundary Value Problem

Ice water Rod Insulation

Rod Cools as Time Progresses

Finite Difference Approximation

Partitioning

• One data item per grid point

• Associate one primitive task with each grid point

• Two-dimensional domain decomposition

Communication

• Identify communication pattern between primitive tasks

• Each interior primitive task has three incoming and three outgoing channels

Agglomeration and Mapping

Agglomeration

Sequential execution time

– time to update element

• n – number of elements

• m – number of iterations

• Sequential execution time: m (n-1)

Parallel Execution Time

• p – number of processors – message latency

• Parallel execution time m((n-1)/p+2)

Reduction

• Given associative operator • a0 a1 a2 … an-1

• Examples– Add– Multiply– And, Or– Maximum, Minimum

Parallel Reduction Evolution



Binomial Trees

Subgraph of hypercube

Finding Global Sum

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1

Finding Global Sum

1 7 -6 4

4 5 8 2

Finding Global Sum

8 -2

9 10

Finding Global Sum

17 8

Finding Global Sum

25

Binomial Tree

Agglomeration

Agglomeration

sum

sum sum

sum

The n-body Problem

The n-body Problem

Partitioning

• Domain partitioning

• Assume one task per particle

• Task has particle’s position, velocity vector

• Iteration– Get positions of all other particles– Compute new position, velocity

Gather

All-gather

Complete Graph for All-gather

Hypercube for All-gather

Communication Time

p

pnp

p

np

i

)1(

log2

log

1

1-i

Hypercube

Complete graph

p

pnp

pnp

)1(

)1()/

)(1(

Adding Data Input

Scatter

Scatter in log p Steps

12345678 56781234 56 12

7834

Summary: Task/channel Model

• Parallel computation– Set of tasks– Interactions through channels

• Good designs– Maximize local computations– Minimize communications– Scale up

Summary: Design Steps

• Partition computation

• Agglomerate tasks

• Map tasks to processors

• Goals– Maximize processor utilization– Minimize inter-processor communication

Summary: Fundamental Algorithms

• Reduction

• Gather and scatter

• All-gather

High Throughput Computing

• Easy problems – formerly known as “embarrassingly parallel” – now known as “pleasingly parallel”

• Basic idea – “Gee – I have a whole bunch of jobs (single run of a program) that I need to do, why not run them concurrently rather than sequentially”

• Sometimes called “bag of tasks” or parameter sweep problems

Bag-of-tasks

Examples

• A large number of proteins – each represented by a different file – to “dock” with a target protein– For all files x, execute f(x,y)

• Exploring a parameter space in n-dimensions– Uniform– Non-uniform

• Monte carlo’s

Tools

• Most common tool is a queuing system – sometimes called a load management system, or a local resource manager

• PBS, LSF, and SGE are the three most common. Condor is also often used.

• They all have the same basic functions, we’ll use PBS as an exemplar.

• Script languages (bash, Perl, etc.)

PBS

• qsub options script-file– Submit the script to run– Options can specify number of processors,

other required resources (memory, etc.)– Returns the job ID (a string)

Other PBS

• qstat – give the status of jobs submitted to the queue

• qdel – delete a job from the queue

Blasting a set of jobs

Issues

• Overhead per job is substantial– Don’t want to run millisecond jobs– May need to “bundle them up”

• May not be enough jobs to saturate resources– May need to break up jobs

• IO System may become saturated– Copy large files to /tmp, check for existence in your shell script,

copy if not there

• May be more jobs than the queuing system can handle (many start to break down at several thousand jobs)

• Jobs may fail for no good reason– Develop scripts to check for output and re-submit upto k jobs

Homework

1. Submit a simple job to the queue that echo’s the host name, redirect output to a file of your choice.

2. Via a script submit 100 “hostname” jobs to a script. Output should be “output.X” where X is the output number

3. For each file in a rooted directory tree run a “wc” to count the words. Maintain the results in a “shadow” directory tree. Your script should be able to detect results that have already been computed.

Day 2

Documents

new parallel language

portion of data

sequential programs

data setokay

new parallel computers

different data elementsfirst

availableextend language

sequential languagecreate