Top Banner
Day 2
147

Day 2

Jan 08, 2016

Download

Documents

havyn

Day 2. Agenda. Parallelism basics Parallel machines Parallelism again High Throughput Computing Finding the right grain size. One thing to remember. Easy. Hard. Seeking Concurrency. Data dependence graphs Data parallelism Functional parallelism Pipelining. Data Dependence Graph. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Day 2

Day 2

Page 2: Day 2

Agenda

• Parallelism basics

• Parallel machines

• Parallelism again

• High Throughput Computing– Finding the right grain size

Page 3: Day 2

One thing to remember

Easy Hard

Page 4: Day 2

Seeking Concurrency

• Data dependence graphs

• Data parallelism

• Functional parallelism

• Pipelining

Page 5: Day 2

Data Dependence Graph

• Directed graph

• Vertices = tasks

• Edges = dependences

Page 6: Day 2

Data Parallelism

• Independent tasks apply same operation to different elements of a data set

• Okay to perform operations concurrently

for i 0 to 99 do a[i] b[i] + c[i]endfor

Page 7: Day 2

Functional Parallelism

• Independent tasks apply different operations to different data elements

• First and second statements• Third and fourth statements

a 2b 3m (a + b) / 2s (a2 + b2) / 2v s - m2

Page 8: Day 2

Pipelining

• Divide a process into stages

• Produce several items simultaneously

Page 9: Day 2

Data Clustering

• Data mining = looking for meaningful patterns in large data sets

• Data clustering = organizing a data set into clusters of “similar” items

• Data clustering can speed retrieval of related items

Page 10: Day 2

Document Vectors

Moon

Rocket

Alice in Wonderland

A Biography of Jules Verne

The Geology of Moon Rocks

The Story of Apollo 11

Page 11: Day 2

Document Clustering

Page 12: Day 2

Clustering Algorithm

• Compute document vectors• Choose initial cluster centers• Repeat

– Compute performance function– Adjust centers

• Until function value converges or max iterations have elapsed

• Output cluster centers

Page 13: Day 2

Data Parallelism Opportunities

• Operation being applied to a data set

• Examples– Generating document vectors– Finding closest center to each vector– Picking initial values of cluster centers

Page 14: Day 2

Functional Parallelism Opportunities

• Draw data dependence diagram

• Look for sets of nodes such that there are no paths from one node to another

Page 15: Day 2

Data Dependence Diagram

Build document vectors

Compute function value

Choose cluster centers

Adjust cluster centers Output cluster centers

Page 16: Day 2

Programming Parallel Computers

• Extend compilers: translate sequential programs into parallel programs

• Extend languages: add parallel operations

• Add parallel language layer on top of sequential language

• Define totally new parallel language and compiler system

Page 17: Day 2

Strategy 1: Extend Compilers

• Parallelizing compiler– Detect parallelism in sequential program– Produce parallel executable program

• Focus on making Fortran programs parallel

Page 18: Day 2

Extend Compilers (cont.)

• Advantages– Can leverage millions of lines of existing serial

programs– Saves time and labor– Requires no retraining of programmers– Sequential programming easier than parallel

programming

Page 19: Day 2

Extend Compilers (cont.)

• Disadvantages– Parallelism may be irretrievably lost when

programs written in sequential languages– Performance of parallelizing compilers on

broad range of applications still up in air

Page 20: Day 2

Extend Language

• Add functions to a sequential language– Create and terminate processes– Synchronize processes– Allow processes to communicate

Page 21: Day 2

Extend Language (cont.)

• Advantages– Easiest, quickest, and least expensive– Allows existing compiler technology to be

leveraged– New libraries can be ready soon after new

parallel computers are available

Page 22: Day 2

Extend Language (cont.)

• Disadvantages– Lack of compiler support to catch errors– Easy to write programs that are difficult to

debug

Page 23: Day 2

Add a Parallel Programming Layer

• Lower layer– Core of computation– Process manipulates its portion of data to produce its

portion of result

• Upper layer– Creation and synchronization of processes– Partitioning of data among processes

• A few research prototypes have been built based on these principles

Page 24: Day 2

Create a Parallel Language

• Develop a parallel language “from scratch”– occam is an example

• Add parallel constructs to an existing language– Fortran 90– High Performance Fortran– C*

Page 25: Day 2

New Parallel Languages (cont.)

• Advantages– Allows programmer to communicate parallelism to

compiler– Improves probability that executable will achieve high

performance

• Disadvantages– Requires development of new compilers– New languages may not become standards– Programmer resistance

Page 26: Day 2

Current Status

• Low-level approach is most popular– Augment existing language with low-level parallel

constructs– MPI and OpenMP are examples

• Advantages of low-level approach– Efficiency– Portability

• Disadvantage: More difficult to program and debug

Page 27: Day 2

Architectures

• Interconnection networks

• Processor arrays (SIMD/data parallel)

• Multiprocessors (shared memory)

• Multicomputers (distributed memory)

• Flynn’s taxonomy

Page 28: Day 2

Interconnection Networks

• Uses of interconnection networks– Connect processors to shared memory– Connect processors to each other

• Interconnection media types– Shared medium– Switched medium

Page 29: Day 2

Shared versus Switched Media

Page 30: Day 2

Shared Medium

• Allows only message at a time

• Messages are broadcast

• Each processor “listens” to every message

• Arbitration is decentralized

• Collisions require resending of messages

• Ethernet is an example

Page 31: Day 2

Switched Medium

• Supports point-to-point messages between pairs of processors

• Each processor has its own path to switch• Advantages over shared media

– Allows multiple messages to be sent simultaneously

– Allows scaling of network to accommodate increase in processors

Page 32: Day 2

Switch Network Topologies

• View switched network as a graph– Vertices = processors or switches– Edges = communication paths

• Two kinds of topologies– Direct– Indirect

Page 33: Day 2

Direct Topology

• Ratio of switch nodes to processor nodes is 1:1

• Every switch node is connected to– 1 processor node– At least 1 other switch node

Page 34: Day 2

Indirect Topology

• Ratio of switch nodes to processor nodes is greater than 1:1

• Some switches simply connect other switches

Page 35: Day 2

Evaluating Switch Topologies

• Diameter

• Bisection width

• Number of edges / node

• Constant edge length? (yes/no)

Page 36: Day 2

2-D Mesh Network

• Direct topology

• Switches arranged into a 2-D lattice

• Communication allowed only between neighboring switches

• Variants allow wraparound connections between switches on edge of mesh

Page 37: Day 2

2-D Meshes

Page 38: Day 2

Vector Computers

• Vector computer: instruction set includes operations on vectors as well as scalars

• Two ways to implement vector computers– Pipelined vector processor: streams data

through pipelined arithmetic units– Processor array: many identical, synchronized

arithmetic processing elements

Page 39: Day 2

Why Processor Arrays?

• Historically, high cost of a control unit

• Scientific applications have data parallelism

Page 40: Day 2

Processor Array

Page 41: Day 2

Data/instruction Storage

• Front end computer– Program– Data manipulated sequentially

• Processor array– Data manipulated in parallel

Page 42: Day 2

Processor Array Performance

• Performance: work done per time unit

• Performance of processor array– Speed of processing elements– Utilization of processing elements

Page 43: Day 2

Performance Example 1

• 1024 processors

• Each adds a pair of integers in 1 sec

• What is performance when adding two 1024-element vectors (one per processor)?

sec/ops10024.1ePerformanc 9sec1

operations1024

Page 44: Day 2

Performance Example 2

• 512 processors

• Each adds two integers in 1 sec

• Performance adding two vectors of length 600?

sec/ops103ePerformanc 6sec2

operations600

Page 45: Day 2

2-D Processor Interconnection Network

Each VLSI chip has 16 processing elements

Page 46: Day 2

if (COND) then A else B

Page 47: Day 2

if (COND) then A else B

Page 48: Day 2

if (COND) then A else B

Page 49: Day 2

Processor Array Shortcomings

• Not all problems are data-parallel

• Speed drops for conditionally executed code

• Don’t adapt to multiple users well

• Do not scale down well to “starter” systems

• Rely on custom VLSI for processors

• Expense of control units has dropped

Page 50: Day 2

Multicomputer, aka Distributed Memory Machines

• Distributed memory multiple-CPU computer• Same address on different processors

refers to different physical memory locations• Processors interact through message

passing• Commercial multicomputers• Commodity clusters

Page 51: Day 2

Asymmetrical Multicomputer

Page 52: Day 2

Asymmetrical MC Advantages

• Back-end processors dedicated to parallel computations Easier to understand, model, tune performance

• Only a simple back-end operating system needed Easy for a vendor to create

Page 53: Day 2

Asymmetrical MC Disadvantages

• Front-end computer is a single point of failure

• Single front-end computer limits scalability of system

• Primitive operating system in back-end processors makes debugging difficult

• Every application requires development of both front-end and back-end program

Page 54: Day 2

Symmetrical Multicomputer

Page 55: Day 2

Symmetrical MC Advantages

• Alleviate performance bottleneck caused by single front-end computer

• Better support for debugging

• Every processor executes same program

Page 56: Day 2

Symmetrical MC Disadvantages

• More difficult to maintain illusion of single “parallel computer”

• No simple way to balance program development workload among processors

• More difficult to achieve high performance when multiple processes on each processor

Page 57: Day 2

Commodity Cluster

• Co-located computers

• Dedicated to running parallel jobs

• No keyboards or displays

• Identical operating system

• Identical local disk images

• Administered as an entity

Page 58: Day 2

Network of Workstations

• Dispersed computers

• First priority: person at keyboard

• Parallel jobs run in background

• Different operating systems

• Different local images

• Checkpointing and restarting important

Page 59: Day 2

DM programming model

• Communicating sequential programs

• Disjoint address spaces

• Communicate sending “messages”

• A message is an array of bytes– Send(dest, char *buf, in len);– receive(&dest, char *buf, int &len);

Page 60: Day 2

Multiprocessors

• Multiprocessor: multiple-CPU computer with a shared memory

• Same address on two different CPUs refers to the same memory location

• Avoid three problems of processor arrays– Can be built from commodity CPUs– Naturally support multiple users– Maintain efficiency in conditional code

Page 61: Day 2

Centralized Multiprocessor

• Straightforward extension of uniprocessor

• Add CPUs to bus

• All processors share same primary memory

• Memory access time same for all CPUs– Uniform memory access (UMA)

multiprocessor– Symmetrical multiprocessor (SMP)

Page 62: Day 2

Centralized Multiprocessor

Page 63: Day 2

Private and Shared Data

• Private data: items used only by a single processor

• Shared data: values used by multiple processors

• In a multiprocessor, processors communicate via shared data values

Page 64: Day 2

Problems Associated with Shared Data

• Cache coherence– Replicating data across multiple caches

reduces contention– How to ensure different processors have

same value for same address?

• Synchronization– Mutual exclusion– Barrier

Page 65: Day 2

Cache-coherence Problem

Cache

CPU A

Cache

CPU B

Memory

7X

Page 66: Day 2

Cache-coherence Problem

CPU A CPU B

Memory

7X

7

Page 67: Day 2

Cache-coherence Problem

CPU A CPU B

Memory

7X

7 7

Page 68: Day 2

Cache-coherence Problem

CPU A CPU B

Memory

2X

7 2

Page 69: Day 2

Write Invalidate Protocol

CPU A CPU B

7X

7 7 Cache control monitor

Page 70: Day 2

Write Invalidate Protocol

CPU A CPU B

7X

7 7

Intent to write X

Page 71: Day 2

Write Invalidate Protocol

CPU A CPU B

7X

7

Intent to write X

Page 72: Day 2

Write Invalidate Protocol

CPU A CPU B

X 2

2

Page 73: Day 2

Distributed Multiprocessor

• Distribute primary memory among processors

• Increase aggregate memory bandwidth and lower average memory access time

• Allow greater number of processors

• Also called non-uniform memory access (NUMA) multiprocessor

Page 74: Day 2

Distributed Multiprocessor

Page 75: Day 2

Cache Coherence

• Some NUMA multiprocessors do not support it in hardware– Only instructions, private data in cache– Large memory access time variance

• Implementation more difficult– No shared memory bus to “snoop”– Directory-based protocol needed

Page 76: Day 2

Flynn’s Taxonomy

• Instruction stream• Data stream• Single vs. multiple• Four combinations

– SISD– SIMD– MISD– MIMD

Page 77: Day 2

SISD

• Single Instruction, Single Data

• Single-CPU systems

• Note: co-processors don’t count– Functional– I/O

• Example: PCs

Page 78: Day 2

SIMD

• Single Instruction, Multiple Data

• Two architectures fit this category– Pipelined vector processor

(e.g., Cray-1)– Processor array

(e.g., Connection Machine)

Page 79: Day 2

MISD

• MultipleInstruction,Single Data

• Example:systolic array

Page 80: Day 2

MIMD

• Multiple Instruction, Multiple Data

• Multiple-CPU computers– Multiprocessors– Multicomputers

Page 81: Day 2

Summary

• Commercial parallel computers appearedin 1980s

• Multiple-CPU computers now dominate

• Small-scale: Centralized multiprocessors

• Large-scale: Distributed memory architectures (multiprocessors or multicomputers)

Page 82: Day 2

Programming the Beast

• Task/channel model

• Algorithm design methodology

• Case studies

Page 83: Day 2

Task/Channel Model

• Parallel computation = set of tasks

• Task– Program– Local memory– Collection of I/O ports

• Tasks interact by sending messages through channels

Page 84: Day 2

Task/Channel Model

TaskTaskChannelChannel

Page 85: Day 2

Foster’s Design Methodology

• Partitioning

• Communication

• Agglomeration

• Mapping

Page 86: Day 2

Foster’s Methodology

P ro b lemP artitio ning

C o m m unic atio n

A gglo m eratio nM ap p ing

Page 87: Day 2

Partitioning

• Dividing computation and data into pieces• Domain decomposition

– Divide data into pieces– Determine how to associate computations with the

data

• Functional decomposition– Divide computation into pieces– Determine how to associate data with the

computations

Page 88: Day 2

Example Domain Decompositions

Page 89: Day 2

Example Functional Decomposition

Page 90: Day 2

Partitioning Checklist

• At least 10x more primitive tasks than processors in target computer

• Minimize redundant computations and redundant data storage

• Primitive tasks roughly the same size

• Number of tasks an increasing function of problem size

Page 91: Day 2

Communication

• Determine values passed among tasks• Local communication

– Task needs values from a small number of other tasks

– Create channels illustrating data flow

• Global communication– Significant number of tasks contribute data to perform

a computation– Don’t create channels for them early in design

Page 92: Day 2

Communication Checklist

• Communication operations balanced among tasks

• Each task communicates with only small group of neighbors

• Tasks can perform communications concurrently

• Task can perform computations concurrently

Page 93: Day 2

Agglomeration

• Grouping tasks into larger tasks

• Goals– Improve performance– Maintain scalability of program– Simplify programming

• In MPI programming, goal often to create one agglomerated task per processor

Page 94: Day 2

Agglomeration Can Improve Performance

• Eliminate communication between primitive tasks agglomerated into consolidated task

• Combine groups of sending and receiving tasks

Page 95: Day 2

Agglomeration Checklist

• Locality of parallel algorithm has increased• Replicated computations take less time than

communications they replace• Data replication doesn’t affect scalability• Agglomerated tasks have similar computational

and communications costs• Number of tasks increases with problem size• Number of tasks suitable for likely target systems• Tradeoff between agglomeration and code

modifications costs is reasonable

Page 96: Day 2

Mapping

• Process of assigning tasks to processors• Centralized multiprocessor: mapping done

by operating system• Distributed memory system: mapping

done by user• Conflicting goals of mapping

– Maximize processor utilization– Minimize interprocessor communication

Page 97: Day 2

Mapping Example

Page 98: Day 2

Optimal Mapping

• Finding optimal mapping is NP-hard

• Must rely on heuristics

Page 99: Day 2

Mapping Decision Tree

• Static number of tasks– Structured communication

• Constant computation time per task– Agglomerate tasks to minimize comm– Create one task per processor

• Variable computation time per task– Cyclically map tasks to processors

– Unstructured communication– Use a static load balancing algorithm

• Dynamic number of tasks

Page 100: Day 2

Mapping Strategy

• Static number of tasks• Dynamic number of tasks

– Frequent communications between tasks• Use a dynamic load balancing algorithm

– Many short-lived tasks• Use a run-time task-scheduling algorithm

Page 101: Day 2

Mapping Checklist

• Considered designs based on one task per processor and multiple tasks per processor

• Evaluated static and dynamic task allocation• If dynamic task allocation chosen, task

allocator is not a bottleneck to performance• If static task allocation chosen, ratio of tasks

to processors is at least 10:1

Page 102: Day 2

Case Studies

• Boundary value problem

• Finding the maximum

• The n-body problem

• Adding data input

Page 103: Day 2

Boundary Value Problem

Ice water Rod Insulation

Page 104: Day 2

Rod Cools as Time Progresses

Page 105: Day 2

Finite Difference Approximation

Page 106: Day 2

Partitioning

• One data item per grid point

• Associate one primitive task with each grid point

• Two-dimensional domain decomposition

Page 107: Day 2

Communication

• Identify communication pattern between primitive tasks

• Each interior primitive task has three incoming and three outgoing channels

Page 108: Day 2

Agglomeration and Mapping

Agglomeration

Page 109: Day 2

Sequential execution time

– time to update element

• n – number of elements

• m – number of iterations

• Sequential execution time: m (n-1)

Page 110: Day 2

Parallel Execution Time

• p – number of processors – message latency

• Parallel execution time m((n-1)/p+2)

Page 111: Day 2

Reduction

• Given associative operator • a0 a1 a2 … an-1

• Examples– Add– Multiply– And, Or– Maximum, Minimum

Page 112: Day 2

Parallel Reduction Evolution

Page 113: Day 2

Parallel Reduction Evolution

Page 114: Day 2

Parallel Reduction Evolution

Page 115: Day 2

Binomial Trees

Subgraph of hypercube

Page 116: Day 2

Finding Global Sum

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1

Page 117: Day 2

Finding Global Sum

1 7 -6 4

4 5 8 2

Page 118: Day 2

Finding Global Sum

8 -2

9 10

Page 119: Day 2

Finding Global Sum

17 8

Page 120: Day 2

Finding Global Sum

25

Binomial Tree

Page 121: Day 2

Agglomeration

Page 122: Day 2

Agglomeration

sum

sum sum

sum

Page 123: Day 2

The n-body Problem

Page 124: Day 2

The n-body Problem

Page 125: Day 2

Partitioning

• Domain partitioning

• Assume one task per particle

• Task has particle’s position, velocity vector

• Iteration– Get positions of all other particles– Compute new position, velocity

Page 126: Day 2

Gather

Page 127: Day 2

All-gather

Page 128: Day 2

Complete Graph for All-gather

Page 129: Day 2

Hypercube for All-gather

Page 130: Day 2

Communication Time

p

pnp

p

np

i

)1(

log2

log

1

1-i

Hypercube

Complete graph

p

pnp

pnp

)1(

)1()/

)(1(

Page 131: Day 2

Adding Data Input

Page 132: Day 2

Scatter

Page 133: Day 2

Scatter in log p Steps

12345678 56781234 56 12

7834

Page 134: Day 2

Summary: Task/channel Model

• Parallel computation– Set of tasks– Interactions through channels

• Good designs– Maximize local computations– Minimize communications– Scale up

Page 135: Day 2

Summary: Design Steps

• Partition computation

• Agglomerate tasks

• Map tasks to processors

• Goals– Maximize processor utilization– Minimize inter-processor communication

Page 136: Day 2

Summary: Fundamental Algorithms

• Reduction

• Gather and scatter

• All-gather

Page 137: Day 2

High Throughput Computing

• Easy problems – formerly known as “embarrassingly parallel” – now known as “pleasingly parallel”

• Basic idea – “Gee – I have a whole bunch of jobs (single run of a program) that I need to do, why not run them concurrently rather than sequentially”

• Sometimes called “bag of tasks” or parameter sweep problems

Page 138: Day 2

Bag-of-tasks

Page 139: Day 2

Examples

• A large number of proteins – each represented by a different file – to “dock” with a target protein– For all files x, execute f(x,y)

• Exploring a parameter space in n-dimensions– Uniform– Non-uniform

• Monte carlo’s

Page 140: Day 2

Tools

• Most common tool is a queuing system – sometimes called a load management system, or a local resource manager

• PBS, LSF, and SGE are the three most common. Condor is also often used.

• They all have the same basic functions, we’ll use PBS as an exemplar.

• Script languages (bash, Perl, etc.)

Page 141: Day 2

PBS

• qsub options script-file– Submit the script to run– Options can specify number of processors,

other required resources (memory, etc.)– Returns the job ID (a string)

Page 142: Day 2
Page 143: Day 2
Page 144: Day 2

Other PBS

• qstat – give the status of jobs submitted to the queue

• qdel – delete a job from the queue

Page 145: Day 2

Blasting a set of jobs

Page 146: Day 2

Issues

• Overhead per job is substantial– Don’t want to run millisecond jobs– May need to “bundle them up”

• May not be enough jobs to saturate resources– May need to break up jobs

• IO System may become saturated– Copy large files to /tmp, check for existence in your shell script,

copy if not there

• May be more jobs than the queuing system can handle (many start to break down at several thousand jobs)

• Jobs may fail for no good reason– Develop scripts to check for output and re-submit upto k jobs

Page 147: Day 2

Homework

1. Submit a simple job to the queue that echo’s the host name, redirect output to a file of your choice.

2. Via a script submit 100 “hostname” jobs to a script. Output should be “output.X” where X is the output number

3. For each file in a rooted directory tree run a “wc” to count the words. Maintain the results in a “shadow” directory tree. Your script should be able to detect results that have already been computed.