CS 61C: Great Ideas in Computer Architecture Flynn Taxonomy, Data-level Parallelism Instructors: Vladimir Stojanovic & Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/ 1
CS61C:GreatIdeasinComputerArchitectureFlynnTaxonomy, Data-levelParallelism
Instructors:VladimirStojanovic&NicholasWeaverhttp://inst.eecs.berkeley.edu/~cs61c/
1
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssigned tocomputere.g.,Search“Katz”
• ParallelThreadsAssigned tocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelined instructions
• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages2
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
Today’sLecture
UsingParallelismforPerformance
• Twobasicways:– Multiprogramming
• runmultipleindependentprogramsinparallel• “Easy”
– Parallelcomputing• runoneprogramfaster• “Hard”
• We’llfocusonparallelcomputingfornextfewlectures
3
Single-Instruction/Single-DataStream(SISD)
• Sequentialcomputerthatexploitsnoparallelism ineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines
4
ProcessingUnit
Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)
• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)
5
Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)
• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.– MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers
6
InstructionPool
PU
PU
PU
PU
DataPoo
l
Multiple-Instruction/Single-DataStream(MISD)
• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.– Rare,mainlyofhistoricalinterestonly
7
Flynn*Taxonomy,1966
• In2013,SIMDandMIMDmostcommonparallelisminarchitectures– usuallybothinsamesystem!
• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)– SingleprogramthatrunsonallprocessorsofaMIMD– Cross-processorexecutioncoordinationusingsynchronization
primitives• SIMD(akahw-leveldataparallelism):specializedfunction
units,forhandlinglock-stepcalculationsinvolvingarrays– Scientificcomputing,signalprocessing,multimedia
(audio/videoprocessing)
8
*Prof.MichaelFlynn,Stanford
SIMDArchitectures• Data parallelism: executing same operation
on multiple data streams• Example to provide context:
– Multiplying a coefficient vector by a data vector (e.g., in filtering)y[i] := c[i] ´x[i], 0 £i < n
• Sources of performance improvement:– One instruction is fetched & decoded for entire
operation– Multiplications are known to be independent– Pipelining/concurrency in memory access as well
9
10
Intel“AdvancedDigitalMediaBoost”
• Toimproveperformance,Intel’sSIMDinstructions– Fetchoneinstruction,dotheworkofmultipleinstructions
11
FirstSIMDExtensions:MITLincolnLabsTX-2,1957
IntelSIMDExtensions
• MMX64-bitregisters,reusingfloating-pointregisters[1992]
• SSE2/3/4,new128-bitregisters[1999]• AVX,new256-bitregisters[2011]
– Spaceforexpansionto1024-bitregisters• AVX-512[2013]
12
13
XMMRegisters
• Architectureextendedwitheight128-bitdataregisters:XMMregisters– x8664-bitaddressarchitectureadds8additionalregisters
(XMM8– XMM15)
IntelArchitectureSSE2+128-BitSIMDDataTypes
146463
6463
6463
3231
3231
9695
9695 161548478079122121
6463 32319695 161548478079122121 16/128bits
8/128bits
4/128bits
2/128bits
• Note:inIntelArchitecture(unlikeMIPS)awordis16bits– Single-precisionFP:Doubleword(32bits)– Double-precisionFP:Quadword(64bits)
SSE/SSE2FloatingPointInstructions
xmm:oneoperandisa128-bitSSE2registermem/xmm:otheroperandisinmemoryoranSSE2register{SS}ScalarSingleprecisionFP:one32-bitoperandina128-bitregister{PS}PackedSingleprecisionFP:four32-bitoperandsina128-bitregister{SD}ScalarDoubleprecisionFP:one64-bitoperandina128-bitregister{PD}PackedDoubleprecisionFP,ortwo64-bitoperandsina128-bitregister{A}128-bitoperandisaligned inmemory{U}meansthe128-bitoperandisunaligned inmemory{H}meansmovethehighhalfofthe128-bitoperand{L}meansmovethelowhalfofthe128-bitoperand
15
Movedoesbothloadand
store
PackedandScalarDouble-PrecisionFloating-PointOperations
16
Packed
Scalar
Example:SIMDArrayProcessing
17
for each f in arrayf = sqrt(f)
for each f in array{
load f to the floating-point registercalculate the square rootwrite the result from the register to memory
}
for each 4 members in array{
load 4 members to the SSE registercalculate 4 square roots in one operationstore the 4 results from the register to memory
}SIMDstyle
Data-LevelParallelismandSIMD
• SIMDwantsadjacentvaluesinmemorythatcanbeoperatedinparallel
• Usuallyspecifiedinprogramsasloopsfor(i=1000; i>0; i=i-1)
x[i] = x[i] + s;• Howcanrevealmoredata-levelparallelismthanavailable inasingleiterationofaloop?
• Unrollloopandadjustiterationrate
18
LoopinginMIPSAssumptions:- $t1isinitiallytheaddressoftheelementinthearraywiththehighest
address- $f0containsthescalarvalues- 8($t2)istheaddressofthelastelementtooperateonCODE:Loop:1. l.d $f2,0($t1) ;$f2=arrayelement
2. add.d $f10,$f2,$f0 ;addsto $f23. s.d $f10,0($t1) ;storeresult4. addiu $t1,$t1,#-8 ;decrementpointer8byte5. bne $t1,$t2,Loop ;repeatloopif $t1!= $t2
19
LoopUnrolledLoop: l.d $f2,0($t1)
add.d $f10,$f2,$f0 s.d $f10,0($t1)l.d $f4,-8($t1)add.d $f12,$f4,$f0 s.d $f12,-8($t1)l.d $f6,-16($t1)add.d $f14,$f6,$f0s.d $f14,-16($t1)l.d $f8,-24($t1)add.d $f16,$f8,$f0 s.d $f16,-24($t1)addiu $t1,$t1,#-32bne $t1,$t2,Loop
NOTE:1. Only1LoopOverheadevery4iterations2. Thisunrollingworksif
loop_limit(mod 4)=03.Usingdifferentregistersforeachiteration
eliminatesdatahazardsinpipeline
20
LoopUnrolledScheduledLoop:l.d $f2,0($t1)
l.d $f4,-8($t1)l.d $f6,-16($t1)l.d $f8,-24($t1)add.d $f10,$f2,$f0 add.d $f12,$f4,$f0add.d $f14,$f6,$f0add.d $f16,$f8,$f0s.d $f10,0($t1)s.d $f12,-8($t1)s.d $f14,-16($t1)s.d $f16,-24($t1)addiu $t1,$t1,#-32bne $t1,$t2,Loop
4Loadsside-by-side:Couldreplacewith4-wideSIMDLoad
4Addsside-by-side:Could replacewith4-wideSIMDAdd
4Storesside-by-side:Could replacewith4-wideSIMDStore
21
LoopUnrollinginC• Insteadofcompilerdoingloopunrolling,coulddoityourselfinCfor(i=1000; i>0; i=i-1)
x[i] = x[i] + s;• Couldberewrittenfor(i=1000; i>0; i=i-4) {
x[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;}
22
WhatisdownsideofdoingitinC?
GeneralizingLoopUnrolling
• Aloopofn iterations• k copiesofthebodyoftheloop• Assuming(n modk)≠0Thenwewillruntheloopwith1copyofthebody (nmodk)timesandwithkcopiesofthebodyfloor(n/k)times
23
Example:AddTwoSingle-PrecisionFloating-PointVectors
Computationtobeperformed:
vec_res.x = v1.x + v2.x;vec_res.y = v1.y + v2.y;vec_res.z = v1.z + v2.z;vec_res.w = v1.w + v2.w;
SSEInstructionSequence:(Note:Destinationontherightinx86assembly)movaps address-of-v1, %xmm0
// v1.w | v1.z | v1.y | v1.x -> xmm0addps address-of-v2, %xmm0
// v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x -> xmm0
movaps %xmm0, address-of-vec_res
24
mov aps :movefrommem toXMMregister,memoryaligned,packedsingleprecision
addps :addfrommem toXMMregister,packedsingleprecision
mov aps :movefromXMMregistertomem,memoryaligned,packedsingleprecision
Administrativia• MT2isMonday,April4th,7-9pm:
– Coverslecturematerialuptillandincluding3/28lecture(Amdahl’slaw)
– Conflict:EmailFredorWilliambymidnighttoday– WatchPiazzaforlocationsannouncement
• GuerrillaSession:FloatingPoint&Performance– Wed3/303- 5PM@241Cory– Sat4/021- 3PM@521Cory
• Project3-2feedback
25
26
Intel SSEIntrinsics
• Intrinsicsare CfunctionsandproceduresforinsertingassemblylanguageintoCcode,includingSSEinstructions– Withintrinsics, canprogramusingtheseinstructionsindirectly
– One-to-onecorrespondencebetween SSEinstructionsandintrinsics
ExampleSSEIntrinsics• Vectordatatype:
_m128d• Loadandstoreoperations:
_mm_load_pd MOVAPD/aligned,packeddouble_mm_store_pd MOVAPD/aligned,packeddouble_mm_loadu_pd MOVUPD/unaligned,packeddouble_mm_storeu_pd MOVUPD/unaligned,packeddouble
• Loadandbroadcastacrossvector_mm_load1_pd MOVSD+shuffling/duplicating
• Arithmetic:_mm_add_pd ADDPD/add,packeddouble_mm_mul_pd MULPD/multiple,packeddouble
CorrespondingSSEinstructions:Instrinsics:
27
Example:2x2MatrixMultiply
Ci,j =(A×B)i,j =∑ Ai,k× Bk,j2
k =1
DefinitionofMatrixMultiply:
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
1 0
0 1
1 3
2 4
x
C1,1=1*1 +0*2=1 C1,2=1*3+0*4=3
C2,1=0*1 +1*2=2 C2,2=0*3+1*4=4
=
28
Example:2x 2MatrixMultiply
• UsingtheXMMregisters– 64-bit/doubleprecision/twodoublesperXMMreg
C1
C2
C1,1
C1,2
C2,1
C2,2Storedinmemory inColumnorder
B1
B2
Bi,1
Bi,2
Bi,1
Bi,2
A A1,i A2,i
C1,1 C1,2
C2,1 C2,2
�
C1 C2
29
Example:2x 2MatrixMultiply
• Initialization
• I=1
C1
C2
0
0
0
0
B1
B2
B1,1
B1,2
B1,1
B1,2
A A1,1 A2,1 _mm_load_pd: Storedinmemory inColumnorder
_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister
30
• Initialization
• i =1
C1
C2
0
0
0
0
B1
B2
B1,1
B1,2
B1,1
B1,2
A A1,1 A2,1 _mm_load_pd: Load2doubles intoXMMreg,Stored inmemory inColumnorder
_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
31
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
Example:2x2MatrixMultiply
Example:2x 2MatrixMultiply
• Firstiterationintermediateresult
• i =1
C1
C2
B1
B2
B1,1
B1,2
B1,1
B1,2
A A1,1 A2,1 _mm_load_pd: Storedinmemory inColumnorder
0+A1,1B1,1
0+A1,1B1,2
0+A2,1B1,1
0+A2,1B1,2
c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructions firstdoparallelmultipliesandthenparalleladdsinXMMregisters
_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
32
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
Example:2x 2MatrixMultiply
• Firstiterationintermediateresult
• i =2
C1
C2
0+A1,1B1,1
0+A1,1B1,2
0+A2,1B1,1
0+A2,1B1,2
B1
B2
B2,1
B2,2
B2,1
B2,2
A A1,2 A2,2_mm_load_pd: Storedinmemory inColumnorder
c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructions firstdoparallelmultipliesandthenparalleladdsinXMMregisters
_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
33
Example:2x 2MatrixMultiply
• Seconditerationintermediateresult
• i =2
C1
C2
A1,1B1,1+A1,2B2,1
A1,1B1,2+A1,2B2,2
A2,1B1,1+A2,2B2,1
A2,1B1,2+A2,2B2,2
B1
B2
B2,1
B2,2
B2,1
B2,2
A A1,2 A2,2 _mm_load_pd: Storedinmemory inColumnorder
C1,1
C1,2
C2,1
C2,2
c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructions firstdoparallelmultipliesandthenparalleladdsinXMMregisters
_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
34
Example:2x2MatrixMultiply(Part1of2)
#include <stdio.h>//headerfileforSSEcompilerintrinsics#include <emmintrin.h>
//NOTE:vectorregisterswillberepresentedincommentsasv1=[a|b]
//wherev1isavariableoftype__m128danda,b aredoubles
int main(void) {//allocateA,B,Calignedon16-byteboundariesdoubleA[4]__attribute__((aligned(16)));doubleB[4]__attribute__((aligned (16)));doubleC[4] __attribute__((aligned(16)));int lda =2;int i =0;//declareseveral128-bitvectorvariables__m128dc1,c2,a,b1,b2;
//InitializeA,B,Cforexample/*A=(notecolumnorder!)
1001*/A[0]=1.0;A[1]=0.0;A[2]=0.0;A[3]=1.0;
/*B=(notecolumnorder!)1324*/B[0]=1.0;B[1]=2.0;B[2]=3.0;B[3]=4.0;
/*C=(notecolumnorder!)0000*/C[0]=0.0;C[1]=0.0;C[2]=0.0;C[3]=0.0;
35
Example:2x 2MatrixMultiply(Part2of2)
//usedalignedloadstoset//c1=[c_11|c_21]c1=_mm_load_pd(C+0*lda);//c2=[c_12|c_22]c2=_mm_load_pd(C+1*lda);
for(i =0;i <2;i++){/*a=i =0:[a_11|a_21]i =1:[a_12|a_22]*/a=_mm_load_pd(A+i*lda);/*b1=i =0:[b_11|b_11]i =1:[b_21|b_21]*/b1=_mm_load1_pd(B+i+0*lda);/*b2=i =0:[b_12|b_12]i =1:[b_22|b_22]*/b2=_mm_load1_pd(B+i+1*lda);
/*c1=i =0:[c_11+a_11*b_11 |c_21+a_21*b_11]i =1:[c_11+a_21*b_21 |c_21+a_22*b_21]*/c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));/*c2=i =0:[c_12+a_11*b_12 |c_22+a_21*b_12]i =1:[c_12+a_21*b_22 |c_22+a_22*b_22]*/c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));
}
//storec1,c2backintoCforcompletion_mm_store_pd(C+0*lda,c1);_mm_store_pd(C+1*lda,c2);
//printCprintf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);return0;
}
36
Innerloopfromgcc –O-SL2: movapd (%rax,%rsi),%xmm1 //LoadalignedA[i,i+1]->m1
movddup (%rdx),%xmm0 //LoadB[j],duplicate->m0mulpd %xmm1,%xmm0 //Multiplym0*m1->m0addpd %xmm0,%xmm3 //Addm0+m3->m3movddup 16(%rdx),%xmm0 //LoadB[j+1],duplicate->m0mulpd %xmm0,%xmm1 //Multiplym0*m1->m1addpd %xmm1,%xmm2 //Addm1+m2->m2addq $16,%rax //rax+16->rax (i+=2)addq $8,%rdx //rdx+8->rdx (j+=1)cmpq $32,%rax //rax ==32?jne L2 //jumptoL2ifnotequalmovapd %xmm3,(%rcx) //storealignedm3intoC[k,k+1]movapd %xmm2,(%rdi) //storealignedm2intoC[l,l+1]
37
AndinConclusion,…
• FlynnTaxonomy• IntelSSESIMDInstructions
– Exploitdata-levelparallelisminloops– Oneinstructionfetchthatoperatesonmultipleoperandssimultaneously
– 128-bitXMMregisters• SSEInstructionsinC
– EmbedtheSSEmachineinstructionsdirectlyintoCprogramsthroughuseofintrinsics
– Achieveefficiencybeyondthatofoptimizingcompiler
38