Copyright © The McGraw-Hill Companies, Inc. Permission ...acc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter3s16.pdf · Copyright © The McGraw-Hill Companies, Inc. Permission

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Architecture and Parallel Algorithm Design

Outline:Outline:● Multicore architectureMulticore architecture● Task/channel modelTask/channel model● Algorithm design methodologyAlgorithm design methodology● Case studiesCase studies


Moore's law based on transistor count


Intel i7 processor


Intel I-7 supports hyperthreading


AMD 6 core Istanbul Opteron


Task/Channel Model

Parallel computation = set of tasksParallel computation = set of tasks TaskTask

ProgramProgram Local memoryLocal memory Collection of I/O portsCollection of I/O ports

Tasks interact by sending messages through Tasks interact by sending messages through channelschannels


Task/Channel Model

TaskChannel


Foster’s Design Methodology

PartitioningPartitioning CommunicationCommunication AgglomerationAgglomeration MappingMapping


Foster’s Methodology

P r o b l e mP a r t i t i o n i n g

C o m m u n i c a t i o n

A g g l o m e r a t i o nM a p p i n g


Partitioning

Dividing computation and data into piecesDividing computation and data into pieces Domain decompositionDomain decomposition

Divide data into piecesDivide data into pieces Determine how to associate computations with Determine how to associate computations with

the datathe data Functional decompositionFunctional decomposition

Divide computation into piecesDivide computation into pieces Determine how to associate data with the Determine how to associate data with the

computationscomputations


Example Domain Decompositions


Example Functional Decomposition


Partitioning Checklist

At least 10x more primitive tasks than At least 10x more primitive tasks than processors in target computerprocessors in target computer

Minimize redundant computations and Minimize redundant computations and redundant data storageredundant data storage

Primitive tasks roughly the same sizePrimitive tasks roughly the same size Number of tasks an increasing function of Number of tasks an increasing function of

problem sizeproblem size


Communication

Determine values passed among tasksDetermine values passed among tasks Local communicationLocal communication

Task needs values from a small number of other Task needs values from a small number of other taskstasks

Create channels illustrating data flowCreate channels illustrating data flow Global communicationGlobal communication

Significant number of tasks contribute data to Significant number of tasks contribute data to perform a computationperform a computation

Don’t create channels for them early in designDon’t create channels for them early in design


Communication Checklist

Communication operations balanced among Communication operations balanced among taskstasks

Each task communicates with only small Each task communicates with only small group of neighborsgroup of neighbors

Tasks can perform communications Tasks can perform communications concurrentlyconcurrently

Task can perform computations Task can perform computations concurrentlyconcurrently


Agglomeration

Grouping tasks into larger tasksGrouping tasks into larger tasks GoalsGoals

Improve performanceImprove performance Maintain scalability of programMaintain scalability of program Simplify programmingSimplify programming

In MPI programming, goal often to create In MPI programming, goal often to create one agglomerated task per processorone agglomerated task per processor


Agglomeration Can Improve Performance Eliminate communication between Eliminate communication between

primitive tasks agglomerated into primitive tasks agglomerated into consolidated taskconsolidated task

Combine groups of sending and receiving Combine groups of sending and receiving taskstasks


Agglomeration Checklist

Locality of parallel algorithm has increasedLocality of parallel algorithm has increased Replicated computations take less time than Replicated computations take less time than

communications they replacecommunications they replace Data replication doesn’t affect scalabilityData replication doesn’t affect scalability Agglomerated tasks have similar computational Agglomerated tasks have similar computational

and communications costsand communications costs Number of tasks increases with problem sizeNumber of tasks increases with problem size Number of tasks suitable for likely target systemsNumber of tasks suitable for likely target systems Tradeoff between agglomeration and code Tradeoff between agglomeration and code

modifications costs is reasonablemodifications costs is reasonable


Mapping

Process of assigning tasks to processorsProcess of assigning tasks to processors Centralized multiprocessor: mapping done Centralized multiprocessor: mapping done

by operating systemby operating system Distributed memory system: mapping done Distributed memory system: mapping done

by userby user Conflicting goals of mappingConflicting goals of mapping

Maximize processor utilizationMaximize processor utilization Minimize interprocessor communicationMinimize interprocessor communication


Mapping Example


Optimal Mapping

Finding optimal mapping is NP-hardFinding optimal mapping is NP-hard Must rely on heuristicsMust rely on heuristics


Mapping Decision Tree

Static number of tasksStatic number of tasks Structured communicationStructured communication

Constant computation time per taskConstant computation time per task• Agglomerate tasks to minimize commAgglomerate tasks to minimize comm• Create one task per processorCreate one task per processor

Variable computation time per taskVariable computation time per task• Cyclically map tasks to processorsCyclically map tasks to processors

Unstructured communicationUnstructured communication• Use a static load balancing algorithmUse a static load balancing algorithm

Dynamic number of tasksDynamic number of tasks


Mapping Strategy

Static number of tasksStatic number of tasks Dynamic number of tasksDynamic number of tasks

Frequent communications between tasksFrequent communications between tasksUse a dynamic load balancing Use a dynamic load balancing

algorithmalgorithm Many short-lived tasksMany short-lived tasks

Use a run-time task-scheduling Use a run-time task-scheduling algorithmalgorithm


Mapping Checklist

Considered designs based on one task per Considered designs based on one task per processor and multiple tasks per processorprocessor and multiple tasks per processor

Evaluated static and dynamic task allocationEvaluated static and dynamic task allocation If dynamic task allocation chosen, task If dynamic task allocation chosen, task

allocator is not a bottleneck to performanceallocator is not a bottleneck to performance If static task allocation chosen, ratio of tasks If static task allocation chosen, ratio of tasks

to processors is at least 10:1to processors is at least 10:1


Case Studies

Boundary value problemBoundary value problem Finding the maximumFinding the maximum The n-body problemThe n-body problem Adding data inputAdding data input


Partitioning

One data item per grid pointOne data item per grid point Associate one primitive task with each grid Associate one primitive task with each grid

pointpoint Two-dimensional domain decompositionTwo-dimensional domain decomposition


Communication

Identify communication pattern between Identify communication pattern between primitive tasksprimitive tasks

Each interior primitive task has three Each interior primitive task has three incoming and three outgoing channelsincoming and three outgoing channels


Agglomeration and Mapping

Agglomeration


Reduction

Given associative operator Given associative operator aa00 aa11 aa22 … … aan-1n-1

ExamplesExamples AddAdd MultiplyMultiply And, OrAnd, Or Maximum, MinimumMaximum, Minimum


Parallel Reduction Evolution






Binomial Trees

Subgraph of hypercube


Finding Global Sum

4 2 0 7

-3

5 -6

-3

8 1 2 3

-4

4 6 -1


Finding Global Sum

1 7 -6

4

4 5 8 2


Finding Global Sum

8 -2

9 10


Finding Global Sum

17

8


Finding Global Sum

25

Binomial Tree


Agglomeration


Agglomeration

sum

sum sum

sum


Summary: Task/channel Model

Parallel computationParallel computation Set of tasksSet of tasks Interactions through channelsInteractions through channels

Good designsGood designs Maximize local computationsMaximize local computations Minimize communicationsMinimize communications Scale upScale up


Summary: Design Steps

Partition computationPartition computation Agglomerate tasksAgglomerate tasks Map tasks to processorsMap tasks to processors GoalsGoals

Maximize processor utilizationMaximize processor utilization Minimize inter-processor communicationMinimize inter-processor communication

Copyright © The McGraw-Hill Companies, Inc. Permission ...acc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter3s16.pdf · Copyright © The McGraw-Hill Companies, Inc. Permission

Documents