Top Banner
Lecture 8: Principles of Parallel Algorithm Design 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan [email protected] http://cse.sc.edu/~yanyh
51

Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

May 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Lecture8:PrinciplesofParallelAlgorithmDesign

1

CSCE569ParallelComputing

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://cse.sc.edu/~yanyh

Page 2: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Topics

• Introduction• Programmingonsharedmemorysystem(Chapter7)

– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonlargescalesystems(Chapter6)

– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel

• Analysisofparallelprogramexecutions(Chapter5)– PerformanceMetricsforParallelSystems

• ExecutionTime,Overhead,Speedup,Efficiency,Cost– ScalabilityofParallelSystems– Useofperformancetools

2

Page 3: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Topics

• Programmingonsharedmemorysystem(Chapter7)– Cilk/Cilkplus andOpenMP Tasking– PThread,mutualexclusion,locks,synchronizations

• Parallelarchitecturesandhardware– Parallelcomputerarchitectures– Memoryhierarchyandcachecoherency

• Manycore GPUarchitecturesandprogramming– GPUsarchitectures– CUDAprogramming– IntroductiontooffloadingmodelinOpenMP

3

Page 4: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

“parallelandfor”OpenMPConstructs

4

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel shared (a, b)

{

int id, i, Nthrds, istart, iend;id = omp_get_thread_num();Nthrds = omp_get_num_threads();istart = id * N / Nthrds;iend = (id+1) * N / Nthrds;for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }

}

#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

Sequential code

OpenMP parallel region

OpenMP parallel region and a worksharing for construct

Page 5: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

OpenMPBestPractices

#pragmaomp parallelprivate(i){#pragmaomp fornowaitfor(i=0;i<n;i++)a[i]+=b[i];

#pragmaomp fornowaitfor(i=0;i<n;i++)c[i]+=d[i];

#pragmaomp barrier#pragmaomp fornowait reduction(+:sum)for(i=0;i<n;i++)sum+=a[i]+c[i];

}

5

Page 6: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

• Falsesharing– Whenatleastonethreadwritetoa

cachelinewhileothersaccessit• Thread0:=A[1](read)• Thread1:A[0]=…(write)

• Solution:usearraypadding

int a[max_threads];#pragma omp parallel for schedule(static,1)for(int i=0; i<max_threads; i++)

a[i] +=i;

int a[max_threads][cache_line_size];#pragma omp parallel for schedule(static,1)for(int i=0; i<max_threads; i++)

a[i][0] +=i;

False-sharinginOpenMPandSolution

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

False Sharing

CPUs Caches Memory

A store into a shared cache line invalidates the other copies of that line:

The system is not able to distinguish between changes

within one individual line

6

A

T0

T1

Page 7: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

NUMAFirst-touch

7

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

About “First Touch” placement/2

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[49]

#pragma omp parallel for num_threads(2)

First TouchBoth memories each have “their half” of

the array

a[50] :a[99]

Page 8: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

SPMD ProgramModelsinOpenMP

8

• SPMD(SingleProgram,MultipleData)forparallelregions– Allthreadsoftheparallelregionexecutethesamecode– EachthreadhasuniqueID

• UsethethreadIDtodivergetheexecutionofthethreads– Differentthreadcanfollowdifferentpathsthroughthesame

code

• SPMDisbyfarthemostcommonlyusedpatternforstructuringparallelprograms– MPI,OpenMP,CUDA,etc

if(my_id == x) { }else { }

Page 9: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Overview:AlgorithmsandConcurrency(part1)

• IntroductiontoParallelAlgorithms– TasksandDecomposition– ProcessesandMapping

• DecompositionTechniques– RecursiveDecomposition– DataDecomposition– ExploratoryDecomposition– HybridDecomposition

• CharacteristicsofTasksandInteractions– TaskGeneration,Granularity,andContext– CharacteristicsofTaskInteractions.

9

Page 10: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Overview:ConcurrencyandMapping(part2)

• MappingTechniquesforLoadBalancing– StaticandDynamicMapping

• MethodsforMinimizingInteractionOverheads– MaximizingDataLocality– MinimizingContentionandHot-Spots– OverlappingCommunicationandComputations– Replicationvs.Communication– GroupCommunicationsvs.Point-to-PointCommunication

• ParallelAlgorithmDesignModels– Data-Parallel,Work-Pool,TaskGraph,Master-Slave,Pipeline,

andHybridModels

10

Page 11: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Decomposition,Tasks,andDependencyGraphs

• Decomposeworkintotasksthatcanbeexecutedconcurrently• Decompositioncouldbeinmanydifferentways.• Tasksmaybeofsame,different,orevenindeterminatesizes.• Taskdependencygraph:

– node=task– edge=controldependence,output-inputdependency– Nodependency==parallelism

11

Page 12: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Example:DenseMatrixVectorMultiplication

12

X=

• Computationofeachelementofoutputvectoryisindependent

• Decomposedintontasks,oneperelementinyà Easy• Observations

– EachtaskonlyreadsonerowofA,andwritesoneelementofy– Alltaskssharevectorb(theshareddata)– Nocontroldependenciesbetweentasks– Alltasksareofthesamesizeintermsofnumberofoperations.

REALA[n][n],b[n],y[n];int i,j;for(i =0;i <n;i++){sum=0.0;for(j=0;j<n;j++)sum+=A[i][j]*b[j];

c[i]=sum;}

Page 13: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Example:DatabaseQueryProcessingConsidertheexecutionofthequery:

MODEL=``CIVIC''AND YEAR=2001AND(COLOR=``GREEN''OR COLOR=``WHITE”)

onthefollowingtable:

ID# Model Year Color Dealer Price 4523 Civic 2002 Blue MN $18,000 3476 Corolla 1999 White IL $15,000 7623 Camry 2001 Green NY $21,000 9834 Prius 2001 Green CA $18,000 6734 Civic 2001 White OR $17,000 5342 Altima 2001 Green FL $19,000 3845 Maxima 2001 Blue NY $22,000 8354 Accord 2000 Green VT $18,000 4395 Civic 2001 Red CA $17,000 7352 Civic 2002 Red WA $18,000

13

Page 14: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Example:DatabaseQueryProcessing• Tasks:Eachtasksearchthewholetableforentriesthatsatisfy

onepredicate– Results:Alistofentries

• Edges:outputofonetaskservesasinputtothenextMODEL=``CIVIC''AND YEAR=2001AND

(COLOR=``GREEN''OR COLOR=``WHITE”)

14

Page 15: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Example:DatabaseQueryProcessing• Analternatedecomposition

MODEL=``CIVIC''AND YEAR=2001AND(COLOR=``GREEN''OR COLOR=``WHITE)

• Differentdecompositionsmayyielddifferentparallelismandperformance

15

Page 16: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

GranularityofTaskDecompositions• Granularityistasksize(amountofcomputation)

– Dependingonthenumberoftasksforthesameproblemsize• Fine-graineddecomposition=largenumberoftasks• Coarsegraineddecomposition=smallnumberoftasks• Granularityfordensematrix-vectorproduct

– fine-grain:eachtaskcomputesanindividualelementiny– coarser-grain:eachtaskcomputes3elementsiny

X=

16

Page 17: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

DegreeofConcurrency

• Definition:thenumberoftasksthatcanbeexecutedinparallel• Maychangeoverprogramexecution

• Metrics– Maximumdegreeofconcurrency

• Maximumnumberofconcurrenttasksatanypointduringexecution.

– Averagedegreeofconcurrency• Theaveragenumberoftasksthatcanbeprocessedinparallelovertheexecutionoftheprogram

• Speedup:serial_execution_time/parallel_execution_time• Inverserelationshipofdegreeofconcurrencyandtask

granularity– Taskgranularityé(lesstasks),degreeofconcurrencyê– Taskgranularityê(moretasks),degreeofconcurrencyé

17

Page 18: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Examples:DegreeofConcurrency

• Maximumdegreeofconcurrency• Averagedegreeofconcurrency

18

X=

• Databasequery– Max:4– Average:7/3(eachtasktakesthesametime)

• Matrix-vectormultiplication– Max:n– Average:n

Page 19: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

CriticalPathLength

• Adirectedpath: asequenceoftasksthatmustbeserialized– Executedoneafteranother

• Criticalpath:– Thelongestweightedpaththroughoutthegraph

• Criticalpathlength:shortesttimeinwhichtheprogramcanbefinished– Lowerboundonparallelexecutiontime

19

Abuildingproject

Page 20: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

CriticalPathLengthandDegreeofConcurrencyDatabasequerytaskdependencygraph

Questions:Whatarethetasksonthecriticalpathforeachdependencygraph?Whatistheshortestparallelexecutiontime?Howmanyprocessorsareneededtoachievetheminimumtime?Whatisthemaximumdegreeofconcurrency?Whatistheaverageparallelism(averagedegreeofconcurrency)?

Totalamountofwork/(criticalpathlength)2.33(63/27)and1.88(64/34)

20

Page 21: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

LimitsonParallelPerformance

• Whatboundsparallelexecutiontime?– minimumtaskgranularity

• e.g.densematrix-vectormultiplication≤n2 concurrenttasks– fractionofapplicationworkthatcan’tbeparallelized

• moreaboutAmdahl’slawinalaterlecture…– dependenciesbetweentasks– parallelizationoverheads

• e.g.,costofcommunicationbetweentasks

• Metricsofparallelperformance– T1:sequentialexecutiontime,Tp parallelexecutiontimeonp

processors/cores/threads

– speedup=T1/Tp– parallelefficiency=T1/(p*Tp)

21

Page 22: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

TaskInteractionGraphs

• Tasksgenerallyexchangedatawithothers– example:densematrix-vectormultiply

• Ifbisnotreplicatedinalltasks:eachtaskhassome,butnotall• Taskswillhavetocommunicateelementsofb

• Taskinteractiongraph– node=task– edge=interactionordataexchange

• Taskinteractiongraphsvs.taskdependencygraphs– Taskdependencygraphsrepresentcontroldependences– Taskinteractiongraphsrepresentdatadependences

22

Page 23: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

TaskInteractionGraphs:AnExample

Sparsematrixvectormultiplication:A xb• Computationofeachresultelement=anindependenttask.• Onlynon-zeroelementsofsparsematrixA participatein

computation.• Ifpartitionb acrosstasks,i.e.taskTihasb[i]only• Thetaskinteractiongraphofthecomputation=graphofthematrixA• Aistheadjacentmatrixofthegraph

23

Page 24: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

• Finertaskgranularityèmoreoverheadoftaskinteractions– Overheadasaratioofusefulworkofatask

• Example:sparsematrix-vectorproductinteractiongraph

• Assumptions:– eachdot(A[i][j]*b[j])takesunittimetoprocess– eachcommunication(edge)causesanoverheadofaunittime

• Ifnode0isatask:communication=3;computation=4• Ifnodes0,4,and5areatask:communication=5;computation=15

– coarser-graindecomposition→smallercommunication/computationratio(3/4vs 5/15)

TaskInteractionGraphs,Granularity,andCommunication

24

Page 25: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

ProcessesandMapping

• Generally– #oftasks>=#processingelements(PEs)available– parallelalgorithmmustmaptaskstoprocesses

• Mapping:aggregatetasksintoprocesses– Process/thread=processingorcomputingagentthatperformswork– assigncollectionoftasksandassociateddatatoaprocess

• AnPE,e.g.acore/thread,hasitssystemabstraction– NoteasytobindtaskstophysicalPEs,thus,morelayersatleast

conceptuallyfromPEtotaskmapping– ProcessinMPI,threadinOpenMP/pthread,etc

• Theoverloadedtermsofprocessesandthreads– Taskà processesà OSprocessesà CPUà cores– Forthesakeofsimplicity,processes=PEs/cores

25

Page 26: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

ProcessesandMapping

• Mappingoftaskstoprocessesiscriticaltotheparallelperformanceofanalgorithm.

• Onwhatbasisshouldonechoosemappings?– Taskdependencygraph– Taskinteractiongraph

• Taskdependencygraphs– Toensureequalspreadofworkacrossallprocessesatanypoint

• minimumidling• optimalloadbalance

• Taskinteractiongraphs– Tominimizeinteractions

• Minimizecommunicationandsynchronization26

Page 27: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

ProcessesandMapping

Agoodmappingmustminimizeparallelexecutiontimeby:

• Mappingindependenttaskstodifferentprocesses– Maximizeconcurrency

• Tasksoncriticalpathhavehighpriorityofbeingassignedtoprocesses

• Minimizinginteractionbetweenprocesses– mappingtaskswithdenseinteractionstothesameprocess.

• Difficulty:thesecriteriaoftenconflictwitheachother– E.g.Nodecomposition,i.e.onetask,minimizesinteractionbut

nospeedupatall!

27

Page 28: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

ProcessesandMapping:Example

28

Example:mappingdatabasequeriestoprocesses• Considerthedependencygraphsinlevels

– nonodesinaleveldependupononeanother• Assignalltaskswithinaleveltodifferentprocesses

Page 29: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Overview:AlgorithmsandConcurrency

• IntroductiontoParallelAlgorithms– TasksandDecomposition– ProcessesandMapping

• DecompositionTechniques– RecursiveDecomposition– DataDecomposition– ExploratoryDecomposition– HybridDecomposition

• CharacteristicsofTasksandInteractions– TaskGeneration,Granularity,andContext– CharacteristicsofTaskInteractions.

29

Page 30: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

DecompositionTechniques

30

Sohowdoesonedecomposeataskintovarioussubtasks?

• Nosinglerecipethatworksforallproblems• Practicallyusedtechniques

– Recursivedecomposition– Datadecomposition– Exploratorydecomposition– Speculativedecomposition

Page 31: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

RecursiveDecomposition

Generallysuitedtoproblemssolvableusingthedivide-and-conquerstrategySteps:1. decomposeaproblemintoasetofsub-problems2. recursivelydecomposeeachsub-problem3. stopdecompositionwhenminimumdesiredgranularity

reached

31

Page 32: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

RecursiveDecomposition:Quicksort

32

Ateachlevel andforeachvector1. Selectapivot2. Partitionsetaroundpivot3. Recursivelysorteachsubvector

ê

quicksort(A,lo,hi)iflo<hip=pivot_partition(A,lo,hi)quicksort(A,lo,p-1)quicksort(A,p+1,hi)

Eachvectorcanbesortedconcurrently(i.e.,eachsortingrepresentsanindependentsubtask).

Page 33: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

RecursiveDecomposition:Min

33

procedure SERIAL_MIN (A, n)min = A[0];for i := 1 to n − 1 doif (A[i] < min) min := A[i];

return min;

procedure RECURSIVE_MIN (A,n)if (n= 1)thenmin :=A [0];elselmin :=RECURSIVE_MIN (A,n/2 );rmin :=RECURSIVE_MIN (&(A[n/2]),n- n/2);if (lmin <rmin)thenmin :=lmin;elsemin :=rmin;

returnmin;

Findingtheminimuminavectorusingdivide-and-conquer

Applicabletootherassociativeoperations,e.g.sum,AND…Knownasreductionoperation

Page 34: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

RecursiveDecomposition:Minfinding the minimum in set {4, 9, 1, 7, 8, 11, 2, 12}.

34

Task dependency graph:• RECURSIVE_MIN forms the

binary tree• min finishes and closes• Parallel in the same level

Page 35: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

FibwithOpenMPTasking

• Taskcompletionoccurswhenthetaskreachestheendofthetaskregioncode

• Multipletasksjoinedtocompletethroughtheuseoftasksynchronizationconstructs– taskwait– barrier construct

• taskwait constructs:– #pragmaomp taskwait– !$omp taskwait

35

int fib(int n){int x,y;if(n<2)returnn;else{

#pragmaomp taskshared(x)x=fib(n-1);#pragmaomp taskshared(y)y=fib(n-2);#pragmaomp taskwaitreturnx+y;

}}

Page 36: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

DataDecomposition-- Themostcommonlyusedapproach

• Steps:1. Identifythedataonwhichcomputationsareperformed.2. Partitionthisdataacrossvarioustasks.

• Partitioninginducesadecompositionoftheproblem,i.e.computationispartitioned

• Datacanbepartitionedinvariousways– Criticalforparallelperformance

• Decompositionbasedon– outputdata– inputdata– input+outputdata– intermediatedata

36

Page 37: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

OutputDataDecomposition

• Eachelementoftheoutputcanbecomputedindependentlyofothers– simplyasafunctionoftheinput.

• Anaturalproblemdecomposition

37

Page 38: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

OutputDataDecomposition:MatrixMultiplication

multiplyingtwonxnmatricesA andB toyieldmatrixC

TheoutputmatrixC canbepartitionedintofourtasks:

Task1:

Task2:

Task3:

Task4: 38

Page 39: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

OutputDataDecomposition:ExampleApartitioningofoutputdatadoesnotresultinauniquedecompositionintotasks.Forexample,forthesameproblemasinpreviusfoil,withidenticaloutputdatadistribution,wecanderivethefollowingtwo(other)decompositions:

Decomposition I Decomposition IITask 1: C1,1 = A1,1 B1,1

Task 2: C1,1 = C1,1 + A1,2 B2,1

Task 3: C1,2 = A1,1 B1,2

Task 4: C1,2 = C1,2 + A1,2 B2,2

Task 5: C2,1 = A2,1 B1,1

Task 6: C2,1 = C2,1 + A2,2 B2,1

Task 7: C2,2 = A2,1 B1,2

Task 8: C2,2 = C2,2 + A2,2 B2,2

Task 1: C1,1 = A1,1 B1,1

Task 2: C1,1 = C1,1 + A1,2 B2,1

Task 3: C1,2 = A1,2 B2,2

Task 4: C1,2 = C1,2 + A1,1 B1,2

Task 5: C2,1 = A2,2 B2,1

Task 6: C2,1 = C2,1 + A2,1 B1,1

Task 7: C2,2 = A2,1 B1,2

Task 8: C2,2 = C2,2 + A2,2 B2,239

Page 40: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

OutputDataDecomposition:ExampleCount the frequency of item sets in database transactions

40

• Decomposetheitemsetstocount– eachtaskcomputestotalcountforeach

ofitsitemsets– appendtotalcountsforitemsetsto

producetotalcountresult

Page 41: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

OutputDataDecomposition:Example

Fromthepreviousexample,thefollowingobservationscanbemade:

• Ifthedatabaseoftransactionsisreplicatedacrosstheprocesses,eachtaskcanbeindependentlyaccomplishedwithnocommunication.

• Ifthedatabaseispartitionedacrossprocessesaswell(forreasonsofmemoryutilization),eachtaskfirstcomputespartialcounts.Thesecountsarethenaggregatedattheappropriatetask.

41

Page 42: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

InputDataPartitioning

• Generallyapplicableifeachoutputcanbenaturallycomputedasafunctionoftheinput.

• Inmanycases,thisistheonlynaturaldecompositioniftheoutputisnotclearlyknowna-priori– e.g.,theproblemoffindingtheminimum,sorting.

• Ataskisassociatedwitheachinputdatapartition– Thetaskperformscomputationwithitspartofthedata.– Subsequentprocessingcombinesthesepartialresults.

• MapReduce

42

Page 43: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

InputDataPartitioning:Example

43

Page 44: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

PartitioningInputand OutputData

44

• Partitiononbothinputandoutputformoreconcurrency• Example:itemsetcounting

Page 45: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Histogram

• ParallelizingHistogramusingOpenMP inAssignment2– Similarascountfrequency

45

Page 46: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

IntermediateDataPartitioning

46

• Ifcomputationisasequenceoftransforms– frominputdatatooutputdata,e.g.imageprocessing

workflow• Candecomposebasedondataforintermediatestages

Task1:

Task2:

Task3:

Task4:

Page 47: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

IntermediateDataPartitioning:Example

47

• dense matrix multiplication– visualize this computation in terms of intermediate

matrices D.

Page 48: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

IntermediateDataPartitioning:Example

StageII

Task 01: D1,1,1= A1,1 B1,1 Task 02: D2,1,1= A1,2 B2,1

Task 03: D1,1,2= A1,1 B1,2 Task 04: D2,1,2= A1,2 B2,2

Task 05: D1,2,1= A2,1 B1,1 Task 06: D2,2,1= A2,2 B2,1

Task 07: D1,2,2= A2,1 B1,2 Task 08: D2,2,2= A2,2 B2,2

Task 09: C1,1 = D1,1,1 + D2,1,1 Task 10: C1,2 = D1,1,2 + D2,1,2

Task 11: C2,1 = D1,2,1 + D2,2,1 Task 12: C2,,2 = D1,2,2 + D2,2,2 48

• Adecompositionofintermediatedata:8+4tasks:StageI

Page 49: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

IntermediateDataPartitioning:ExampleThe task dependency graph for the decomposition(shown in previous foil) into 12 tasks is as follows:

49

Page 50: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

TheOwnerComputesRule

50

• Eachdatumisassignedtoaprocess• Eachprocesscomputesvaluesassociatedwithitsdata• Implications

– inputdatadecomposition• allcomputationsusinganinputdatumareperformedbyitsprocess

– outputdatadecomposition• anoutputiscomputedbytheprocessassignedtotheoutputdata

Page 51: Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

References

• Adaptedfromslides“PrinciplesofParallelAlgorithmDesign”byAnanth Grama

• BasedonChapter3of“IntroductiontoParallelComputing”byAnanth Grama,Anshul Gupta,GeorgeKarypis,andVipinKumar.AddisonWesley,2003

51