Lecture 9: Dense Matrices and Decomposition 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan [email protected] http://cse.sc.edu/~yanyh
Lecture9:DenseMatricesandDecomposition
1
CSCE569ParallelComputing
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://cse.sc.edu/~yanyh
Review:ParallelAlgorithmDesignandDecomposition
• IntroductiontoParallelAlgorithms– TasksandDecomposition– ProcessesandMapping
• DecompositionTechniques– RecursiveDecomposition– DataDecomposition– ExploratoryDecomposition– HybridDecomposition
• CharacteristicsofTasksandInteractions– TaskGeneration,Granularity,andContext– CharacteristicsofTaskInteractions.
2
Decomposition,Tasks,andDependencyGraphs
• Decomposeworkintotasksthatcanbeexecutedconcurrently• Decompositioncouldbeinmanydifferentways.• Tasksmaybeofsame,different,orevenindeterminatesizes.• Taskdependencygraph:
– node=task– edge=controldependence,output-inputdependency– Nodependency==parallelism
3
DegreeofConcurrency
• Definition:thenumberoftasksthatcanbeexecutedinparallel• Maychangeoverprogramexecution
• Metrics– Maximumdegreeofconcurrency
• Maximumnumberofconcurrenttasksatanypointduringexecution.
– Averagedegreeofconcurrency• Theaveragenumberoftasksthatcanbeprocessedinparallelovertheexecutionoftheprogram
• Speedup:serial_execution_time/parallel_execution_time• Inverserelationshipofdegreeofconcurrencyandtask
granularity– Taskgranularityé(lesstasks),degreeofconcurrencyê– Taskgranularityê(moretasks),degreeofconcurrencyé
4
CriticalPathLength
• Adirectedpath: asequenceoftasksthatmustbeserialized– Executedoneafteranother
• Criticalpath:– Thelongestweightedpaththroughoutthegraph
• Criticalpathlength:shortesttimeinwhichtheprogramcanbefinished– Lowerboundonparallelexecutiontime
5
Abuildingproject
CriticalPathLengthandDegreeofConcurrencyDatabasequerytaskdependencygraph
Questions:Whatarethetasksonthecriticalpathforeachdependencygraph?Whatistheshortestparallelexecutiontime?Howmanyprocessorsareneededtoachievetheminimumtime?Whatisthemaximumdegreeofconcurrency?Whatistheaverageparallelism(averagedegreeofconcurrency)?
Totalamountofwork/(criticalpathlength)2.33(63/27)and1.88(64/34)
6
• Finertaskgranularityèmoreoverheadoftaskinteractions– Overheadasaratioofusefulworkofatask
• Example:sparsematrix-vectorproductinteractiongraph
• Assumptions:– eachdot(A[i][j]*b[j])takesunittimetoprocess– eachcommunication(edge)causesanoverheadofaunittime
• Ifnode0isatask:communication=3;computation=4• Ifnodes0,4,and5areatask:communication=5;computation=15
– coarser-graindecomposition→smallercommunication/computationratio(3/4vs 5/15)
TaskInteractionGraphs,Granularity,andCommunication
7
ProcessesandMapping
Agoodmappingmustminimizeparallelexecutiontimeby:
• Mappingindependenttaskstodifferentprocesses– Maximizeconcurrency
• Tasksoncriticalpathhavehighpriorityofbeingassignedtoprocesses
• Minimizinginteractionbetweenprocesses– mappingtaskswithdenseinteractionstothesameprocess.
• Difficulty:thesecriteriaoftenconflictwitheachother– E.g.Nodecomposition,i.e.onetask,minimizesinteractionbut
nospeedupatall!
8
RecursiveDecomposition:Min
9
procedure SERIAL_MIN (A, n)min = A[0];for i := 1 to n − 1 doif (A[i] < min) min := A[i];
return min;
procedure RECURSIVE_MIN (A,n)if (n= 1)thenmin :=A [0];elselmin :=RECURSIVE_MIN (A,n/2 );rmin :=RECURSIVE_MIN (&(A[n/2]),n- n/2);if (lmin <rmin)thenmin :=lmin;elsemin :=rmin;
returnmin;
Findingtheminimuminavectorusingdivide-and-conquer
Applicabletootherassociativeoperations,e.g.sum,AND…Knownasreductionoperation
DataDecomposition-- Themostcommonlyusedapproach
• Steps:1. Identifythedataonwhichcomputationsareperformed.2. Partitionthisdataacrossvarioustasks.
• Partitioninginducesadecompositionoftheproblem,i.e.computationispartitioned
• Datacanbepartitionedinvariousways– Criticalforparallelperformance
• Decompositionbasedon– outputdata– inputdata– input+outputdata– intermediatedata
10
OutputDataDecomposition:ExampleCount the frequency of item sets in database transactions
11
• Decomposetheitemsetstocount– eachtaskcomputestotalcountforeach
ofitsitemsets– appendtotalcountsforitemsetsto
producetotalcountresult
InputDataPartitioning:Example
12
Densematrixalgorithms
• DenselinearalgebraandBLAS• Imageprocessing/stencil• Iterativemethods
13
Motifs
TheMotifs(formerly“Dwarfs”)from“TheBerkeleyView” (Asanovic etal.)formkeycomputationalpatterns
14TheLandscapeofParallelComputingResearch:AViewfromBerkeleyhttp://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
Denselinearalgebra• Softwarelibrarysolvinglinearsystem
• BLAS(BasicLinearAlgebraSubprogram)– Vector,matrixvector,matrixmatrix
• LinearSystems:Ax=b• LeastSquares:choosextominimize||Ax-b||2
– Overdetermined orunderdetermined– Unconstrained,constrained,weighted
• EigenvaluesandvectorsofSymmetricMatrices• Standard(Ax=λx),Generalized(Ax=λBx)
• EigenvaluesandvectorsofUnsymmetric matrices• Eigenvalues,Schur form,eigenvectors,invariantsubspaces• Standard,Generalized
• SingularValuesandvectors(SVD)– Standard,Generalized
• Differentmatrixstructures– Real,complex;Symmetric,Hermitian,positivedefinite;dense,triangular,banded…
• Levelofdetail– SimpleDriver– ExpertDriverswitherrorbounds,extra-precision,otheroptions– Lowerlevelroutines(“applycertainkindoforthogonaltransformation”,matmul…) 15
BLAS(BasicLinearAlgebraSubprogram)
• BLAS1,1973-1977– 15operations(mostly)onvectors(1-darray)
• “AXPY”(y =α·x+y ),dotproduct,scale(x=α·x )– Upto4versionsofeach(S/D/C/Z),46routines,3300LOC– WhyBLAS1?TheydoO(n1)opsonO(n1)data:AXPY
• 2nflopson3nread/writes• Computationalintensity=(2n)/(3n)=2/3
16
BLAS2
• BLAS2,1984-1986– 25operations(mostly)onmatrix/vectorpairs– “GEMV”:y=α·A·x +β·x,“GER”:A=A+α·x·yT,x=T-1·x– Upto4versionsofeach(S/D/C/Z),66routines,18KLOC
• WhyBLAS2?TheydoO(n2)opsonO(n2)data– Computationalintensitystilljust~(2n2)/(n2)=2
17
X=
BLAS3
• BLAS 3,1987-1988– 9operations(mostly)onmatrix/matrixpairs
• “GEMM”:C=α·A·B+β·C,C=α·A·AT+β·C,B=T-1·B– Upto4versionsofeach(S/D/C/Z),30routines,10KLOC– WhyBLAS3?TheydoO(n3)opsonO(n2)data
• Computationalintensity(2n3)/(4n2)=n/2– bigatlast!• Goodformachineswithcaches,deepmem hierarchy
18
A[M][K]*B[K][N]=C[M][N]
DecompositionforAXPY,MatrixVector,andMatrixMultiplication
19
BLAS1:AXPY
• y=α·x+y– x andy arevectorsofsizeN
• InC,x[N],y[N]– α isscalar
• Decompositionissimple– Niterations(NelementsofXandY)aredistributedamongthreads– 1:1mappingbetweeniterationandelementofXandY– XandYareshared
20
chunk=3
T0 T1
BLAS2:MatrixVectorMultiplication
• y=A·x– A[M][N],x[N],y[N]
• Row-wisedecomposition
21
Mt
i_start
BLAS3:DenseMatrixMultiplication
22
A[M][K]*B[k][N]=C[M][N]• Base• Base_1:columnmajororderofaccess• row1D_dist• column1D_dist• rowcol2D_dist
• DecompositionistocalculateMtandNt
M
K N
BLAS3:DenseMatrixMultiplication
23
• Row-based1-D
!!!!!!!!!!!!!!!!!!X!!!!!!!!!!!!!!!!!!!!!!!!!!=!
!!!!!!!A!!!!!!!!!X!!!!!!!!!!!!B!!!!!!!!!!!!=!!!!!!!!!!!!!!C!
T0!T1!T2!T3!
BLAS3:DenseMatrixMultiplication
24
• Column-based1-D
!!!!!!!!!!!!!!!!!!X!!!!!!!!!!!!!!!!!!!!!!!!!!=!
!!!!!!!A!!!!!!!!!X!!!!!!!!!!!!B!!!!!!!!!!!!=!!!!!!!!!!!!!!C!
T0!!!T1!!T2!!T3!
BLAS3:DenseMatrixMultiplication
25
• Row/Column-based2-D
NeednestedparallelismexportOMP_NESTED=true
Densematrixalgorithms
• DenselinearalgebraandBLAS• Imageprocessing/stencil• Iterativemethods
26
WhatisMultimedia• Multimedia is a combination of
text, graphic, sound, animation, and video that is delivered interactively to the user by electronic or digitally manipulated means.
https://en.wikipedia.org/wiki/Multimedia
Videoscontainsframe(images)
ImageFormatandProcessing
• Pixels– Imagesarematrixofpixels
• Binaryimages– Eachpixeliseither0or1
ImageFormatandProcessing
• Pixels– Imagesarematrixofpixels
• Grayscale images– Eachpixelvaluenormallyrangefrom0(black)to255(white)– 8bitsperpixel
ImageFormatandProcessing
• Pixels– Imagesarematrixofpixels
• Colorimages– Eachpixelhasthree/fourvalues(4bitsor8bitseach)each
representingacolorscale
Histogram
• Animagehistogramisagraphofpixelintensity(onthe x-axis)versusnumberofpixels(onthe y-axis).The x-axishasallavailablegraylevels,andthe y-axisindicatesthenumberofpixelsthathaveaparticulargray-levelvalue.
31https://www.allaboutcircuits.com/technical-articles/image-histogram-characteristics-machine-learning-image-processing/
HistogramsofMonochromeImage
32http://homepages.inf.ed.ac.uk/rbf/BOOKS/PHILLIPS/cips2edsrc/HIST.C
HistogramofColorImages
• Imagedensity
33https://docs.opencv.org/3.4.0/d3/dc1/tutorial_basic_linear_transform.html
OpenMP ParallelizationofHistogram
• Decompositionbasedonoutput(pixelvalues,0- 255)– Eachthreadsearchesthewholeimagetoonlycountthose
pixelsthathavethevalueitshouldcountfor• E.g.with4threads:0-63forthread0,64-127forthread1,…
• Decompositionbasedontheinput(image)– Eachthreadsearchpartoftheimagetocountallthepixels
andstorethepartial histogramlocally– Addupallthepartialhistogram
34
ImageFiltering
• Changingpixelvaluesbydoinga convolution betweenakernel(filter)andan image.
ImageFiltering:Themagicofthefiltermatrix
• http://lodev.org/cgtutor/filtering.html• https://en.wikipedia.org/wiki/Kernel_(image
_processing)
• Itisthebasicofconvolutionneuralnetwork
ConvolutionNeuralNetworkforObjectDetection
• Pooling:sample-baseddiscretizationprocess
37
http://cs231n.github.io/convolutional-networks/
OpenMP ParallelizationofImageFiltering
• Decompositionaccordingtotheinputimage• Sinceinputandoutputimagesareseparate,itisstraightforward– Couldberow1D,col1D,rowcol2D
• False-sharingforwritingboundaryofoutputimages
38
Densematrixalgorithms
• DenselinearalgebraandBLAS• Imageprocessing/stencil• Iterativemethods
39
IterativeMethods
• Iterativemethodscanbeexpressedinthegeneralform:x(k) =F(x(k-1))
Hopefully:x(k)® s(solutionofmyproblem)
• Widevarietyofcomputationalscienceproblem– CFD,moleculardynamics,weather/climateforecast,
cosmology,
• Willitconverge?Howrapidly?
IterativeStencilApplications
Loopuntilsomeconditionistrue
PerformcomputationwhichinvolvescommunicatingwithN,E,W,Sneighborsofapoint(5pointstencil)
[Convergencetest?]
Stencilissimilarasimagefiltering/convolution
x(k) =F(x(k-1))
Jacobi.c
• Assignment2and3:
42https://passlab.github.io/CSCE569/Assignment_2/jacobi.c
Jacobi
• Aniterativemethodforapproximatingthesolutiontoasystemoflinearequations.
• Ax=b wheretheith equationis
• a’sandb’sareknown,wanttosolveforx’s
inniii bxaxaxa =+++ ,11,11, !
úû
ùêë
é-= å
¹ijjjii
iii xaba
x ,,
1
OpenMP ParallelizationofJacobi
• Similarasimagefiltering– Enclosedbythewhile tobeiterative
• omp parallel forouterwhile loop• omp for forinnerfor loops• single andreductionareneeded
44
GhostCellExchange
• Forassignment3:
45
Background:Cmultidimensionalarray
46
Vector/MatrixandArrayinC
• Chasrow-majorstorageformultipledimensionalarray– A[2,2]isfollowedbyA[2,3]
• 3-dimensionalarray– B[3][100][100]
• Thinkitasrecursivedefinition– A[4][10][32]
47
charA[4][4]
ColumnMajor
Fortraniscolumnmajor
48
ArrayLayout:WhyWeCare?
1.Makesabigdifferenceforaccessspeed• Forperformance,setupcodetogoinrowmajororderinC
– Caching:eachreadfrommemorywillbringotheradjacentelementstothecacheline
• (Bad)Example:4vs 16accesses– matmul_base_1
49
for i = 1 to nfor j = 1 to n
A[j][i] = value
ArrayLayout:WhyWeCare?
2.Affectdecompositionanddatamovement• Decompositionmaycreatesubmatrices thatareinnon-contiguousmemorylocations,e.g.A3andB1
• Submatrices incontiguousmemorylocationof2-Drowmajormatrix– Asingle-rowsubmatrix,e.g.A2– Asubmatrix formedwithadjacentrowswithfullcolumn
length,e.g.A1
50
A1
A2 B1
A3
ArrayLayout:WhyWeCare?
2.Affectdecompositionandsubmatrix• Roworcolumnwisedistributionof2-Drow-majorarray• #ofdatamovementtoexchangedatabetweenT0andT1
– Row-wise:onememorycopybyeach– Column-wise:16copieseach
51T0T1T2T3
Row-wisedistribution Column-wisedistribution
ArrayandpointersinC
• InC,anarrayisapointer+dimensionality– Theyareliterallythesameinbinary,i.e.pointertothefirst
element,referencedasbaseaddress• Castandassignmentfromarraytopointe,int A[M][N]
• A,&A[0][0],andA[0]havethesamevalue,i.e.thepointertothefirstelementofthearray
• Castapointertoanarray– int *ap;int (*A)[N]=(int(*)[N])ap;A[i][j]….
• Addresscalculationforarrayreferences– AddressofA[i][j]=A+(i*N+j)*sizeof (int)
52