Flynn Taxonomy,Data-level Parallelismcs61c/sp16/lec/26/2016Sp... · 2016. 3. 30. · Flynn* Taxonomy, 1966 • In 2013, SIMD and MIMD most common parallelism in architectures –

CS61C:GreatIdeasinComputerArchitectureFlynnTaxonomy, Data-levelParallelism

Instructors:VladimirStojanovic&NicholasWeaverhttp://inst.eecs.berkeley.edu/~cs61c/

1

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssigned tocomputere.g.,Search“Katz”

• ParallelThreadsAssigned tocoree.g.,Lookup,Ads

• ParallelInstructions>[email protected].,5pipelined instructions

• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages2

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

Core

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Today’sLecture

UsingParallelismforPerformance

• Twobasicways:– Multiprogramming

• runmultipleindependentprogramsinparallel• “Easy”

– Parallelcomputing• runoneprogramfaster• “Hard”

• We’llfocusonparallelcomputingfornextfewlectures

3

Single-Instruction/Single-DataStream(SISD)

• Sequentialcomputerthatexploitsnoparallelism ineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines

4

ProcessingUnit

Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)

5

Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)

• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.– MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers

6

InstructionPool

PU

PU

PU

PU

DataPoo

l

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.– Rare,mainlyofhistoricalinterestonly

7

Flynn*Taxonomy,1966

• In2013,SIMDandMIMDmostcommonparallelisminarchitectures– usuallybothinsamesystem!

• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)– SingleprogramthatrunsonallprocessorsofaMIMD– Cross-processorexecutioncoordinationusingsynchronization

primitives• SIMD(akahw-leveldataparallelism):specializedfunction

units,forhandlinglock-stepcalculationsinvolvingarrays– Scientificcomputing,signalprocessing,multimedia

(audio/videoprocessing)

8

*Prof.MichaelFlynn,Stanford

SIMDArchitectures• Data parallelism: executing same operation

on multiple data streams• Example to provide context:

– Multiplying a coefficient vector by a data vector (e.g., in filtering)y[i] := c[i] ´x[i], 0 £i < n

• Sources of performance improvement:– One instruction is fetched & decoded for entire

operation– Multiplications are known to be independent– Pipelining/concurrency in memory access as well

9

10

Intel“AdvancedDigitalMediaBoost”

• Toimproveperformance,Intel’sSIMDinstructions– Fetchoneinstruction,dotheworkofmultipleinstructions

11

FirstSIMDExtensions:MITLincolnLabsTX-2,1957

IntelSIMDExtensions

• MMX64-bitregisters,reusingfloating-pointregisters[1992]

• SSE2/3/4,new128-bitregisters[1999]• AVX,new256-bitregisters[2011]

– Spaceforexpansionto1024-bitregisters• AVX-512[2013]

12

13

XMMRegisters

• Architectureextendedwitheight128-bitdataregisters:XMMregisters– x8664-bitaddressarchitectureadds8additionalregisters

(XMM8– XMM15)

IntelArchitectureSSE2+128-BitSIMDDataTypes

146463

6463

6463

3231

3231

9695

9695 161548478079122121

6463 32319695 161548478079122121 16/128bits

8/128bits

4/128bits

2/128bits

• Note:inIntelArchitecture(unlikeMIPS)awordis16bits– Single-precisionFP:Doubleword(32bits)– Double-precisionFP:Quadword(64bits)

SSE/SSE2FloatingPointInstructions

xmm:oneoperandisa128-bitSSE2registermem/xmm:otheroperandisinmemoryoranSSE2register{SS}ScalarSingleprecisionFP:one32-bitoperandina128-bitregister{PS}PackedSingleprecisionFP:four32-bitoperandsina128-bitregister{SD}ScalarDoubleprecisionFP:one64-bitoperandina128-bitregister{PD}PackedDoubleprecisionFP,ortwo64-bitoperandsina128-bitregister{A}128-bitoperandisaligned inmemory{U}meansthe128-bitoperandisunaligned inmemory{H}meansmovethehighhalfofthe128-bitoperand{L}meansmovethelowhalfofthe128-bitoperand

15

Movedoesbothloadand

store

PackedandScalarDouble-PrecisionFloating-PointOperations

16

Packed

Scalar

Example:SIMDArrayProcessing

17

for each f in arrayf = sqrt(f)

for each f in array{

load f to the floating-point registercalculate the square rootwrite the result from the register to memory

}

for each 4 members in array{

load 4 members to the SSE registercalculate 4 square roots in one operationstore the 4 results from the register to memory

}SIMDstyle

Data-LevelParallelismandSIMD

• SIMDwantsadjacentvaluesinmemorythatcanbeoperatedinparallel

• Usuallyspecifiedinprogramsasloopsfor(i=1000; i>0; i=i-1)

x[i] = x[i] + s;• Howcanrevealmoredata-levelparallelismthanavailable inasingleiterationofaloop?

• Unrollloopandadjustiterationrate

18

LoopinginMIPSAssumptions:- $t1isinitiallytheaddressoftheelementinthearraywiththehighest

address- $f0containsthescalarvalues- 8($t2)istheaddressofthelastelementtooperateonCODE:Loop:1. l.d $f2,0($t1) ;$f2=arrayelement

2. add.d $f10,$f2,$f0 ;addsto $f23. s.d $f10,0($t1) ;storeresult4. addiu $t1,$t1,#-8 ;decrementpointer8byte5. bne $t1,$t2,Loop ;repeatloopif $t1!= $t2

19

LoopUnrolledLoop: l.d $f2,0($t1)

add.d $f10,$f2,$f0 s.d $f10,0($t1)l.d $f4,-8($t1)add.d $f12,$f4,$f0 s.d $f12,-8($t1)l.d $f6,-16($t1)add.d $f14,$f6,$f0s.d $f14,-16($t1)l.d $f8,-24($t1)add.d $f16,$f8,$f0 s.d $f16,-24($t1)addiu $t1,$t1,#-32bne $t1,$t2,Loop

NOTE:1. Only1LoopOverheadevery4iterations2. Thisunrollingworksif

loop_limit(mod 4)=03.Usingdifferentregistersforeachiteration

eliminatesdatahazardsinpipeline

20

LoopUnrolledScheduledLoop:l.d $f2,0($t1)

l.d $f4,-8($t1)l.d $f6,-16($t1)l.d $f8,-24($t1)add.d $f10,$f2,$f0 add.d $f12,$f4,$f0add.d $f14,$f6,$f0add.d $f16,$f8,$f0s.d $f10,0($t1)s.d $f12,-8($t1)s.d $f14,-16($t1)s.d $f16,-24($t1)addiu $t1,$t1,#-32bne $t1,$t2,Loop

4Loadsside-by-side:Couldreplacewith4-wideSIMDLoad

4Addsside-by-side:Could replacewith4-wideSIMDAdd

4Storesside-by-side:Could replacewith4-wideSIMDStore

21

LoopUnrollinginC• Insteadofcompilerdoingloopunrolling,coulddoityourselfinCfor(i=1000; i>0; i=i-1)

x[i] = x[i] + s;• Couldberewrittenfor(i=1000; i>0; i=i-4) {

x[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;}

22

WhatisdownsideofdoingitinC?

GeneralizingLoopUnrolling

• Aloopofn iterations• k copiesofthebodyoftheloop• Assuming(n modk)≠0Thenwewillruntheloopwith1copyofthebody (nmodk)timesandwithkcopiesofthebodyfloor(n/k)times

23

Example:AddTwoSingle-PrecisionFloating-PointVectors

Computationtobeperformed:

vec_res.x = v1.x + v2.x;vec_res.y = v1.y + v2.y;vec_res.z = v1.z + v2.z;vec_res.w = v1.w + v2.w;

SSEInstructionSequence:(Note:Destinationontherightinx86assembly)movaps address-of-v1, %xmm0

// v1.w | v1.z | v1.y | v1.x -> xmm0addps address-of-v2, %xmm0

// v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x -> xmm0

movaps %xmm0, address-of-vec_res

24

mov aps :movefrommem toXMMregister,memoryaligned,packedsingleprecision

addps :addfrommem toXMMregister,packedsingleprecision

mov aps :movefromXMMregistertomem,memoryaligned,packedsingleprecision

Administrativia• MT2isMonday,April4th,7-9pm:

– Coverslecturematerialuptillandincluding3/28lecture(Amdahl’slaw)

– Conflict:EmailFredorWilliambymidnighttoday– WatchPiazzaforlocationsannouncement

• GuerrillaSession:FloatingPoint&Performance– Wed3/303- 5PM@241Cory– Sat4/021- 3PM@521Cory

• Project3-2feedback

25

26

Intel SSEIntrinsics

• Intrinsicsare CfunctionsandproceduresforinsertingassemblylanguageintoCcode,includingSSEinstructions– Withintrinsics, canprogramusingtheseinstructionsindirectly

– One-to-onecorrespondencebetween SSEinstructionsandintrinsics

ExampleSSEIntrinsics• Vectordatatype:

_m128d• Loadandstoreoperations:

_mm_load_pd MOVAPD/aligned,packeddouble_mm_store_pd MOVAPD/aligned,packeddouble_mm_loadu_pd MOVUPD/unaligned,packeddouble_mm_storeu_pd MOVUPD/unaligned,packeddouble

• Loadandbroadcastacrossvector_mm_load1_pd MOVSD+shuffling/duplicating

• Arithmetic:_mm_add_pd ADDPD/add,packeddouble_mm_mul_pd MULPD/multiple,packeddouble

CorrespondingSSEinstructions:Instrinsics:

27

Example:2x2MatrixMultiply

Ci,j =(A×B)i,j =∑ Ai,k× Bk,j2

k =1

DefinitionofMatrixMultiply:

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=

1 0

0 1

1 3

2 4

x

C1,1=1*1 +0*2=1 C1,2=1*3+0*4=3

C2,1=0*1 +1*2=2 C2,2=0*3+1*4=4

=

28

Example:2x 2MatrixMultiply

• UsingtheXMMregisters– 64-bit/doubleprecision/twodoublesperXMMreg

C1

C2

C1,1

C1,2

C2,1

C2,2Storedinmemory inColumnorder

B1

B2

Bi,1

Bi,2

Bi,1

Bi,2

A A1,i A2,i

C1,1 C1,2

C2,1 C2,2

�

C1 C2

29


• Initialization

• I=1

C1

C2

0

0

0

0

B1

B2

B1,1

B1,2

B1,1

B1,2

A A1,1 A2,1 _mm_load_pd: Storedinmemory inColumnorder

_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister

30

• Initialization

• i =1

C1

C2

0

0

0

0

B1

B2

B1,1

B1,2

B1,1

B1,2

A A1,1 A2,1 _mm_load_pd: Load2doubles intoXMMreg,Stored inmemory inColumnorder

_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)

31

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=

Example:2x2MatrixMultiply


• Firstiterationintermediateresult

• i =1

C1

C2

B1

B2

B1,1

B1,2

B1,1

B1,2


0+A1,1B1,1

0+A1,1B1,2

0+A2,1B1,1

0+A2,1B1,2

c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructions firstdoparallelmultipliesandthenparalleladdsinXMMregisters


32

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=


• Firstiterationintermediateresult

• i =2

C1

C2

0+A1,1B1,1

0+A1,1B1,2

0+A2,1B1,1

0+A2,1B1,2

B1

B2

B2,1

B2,2

B2,1

B2,2

A A1,2 A2,2_mm_load_pd: Storedinmemory inColumnorder



33


• Seconditerationintermediateresult

• i =2

C1

C2

A1,1B1,1+A1,2B2,1

A1,1B1,2+A1,2B2,2

A2,1B1,1+A2,2B2,1

A2,1B1,2+A2,2B2,2

B1

B2

B2,1

B2,2

B2,1

B2,2


C1,1

C1,2

C2,1

C2,2



34

Example:2x2MatrixMultiply(Part1of2)

#include <stdio.h>//headerfileforSSEcompilerintrinsics#include <emmintrin.h>

//NOTE:vectorregisterswillberepresentedincommentsasv1=[a|b]

//wherev1isavariableoftype__m128danda,b aredoubles

int main(void) {//allocateA,B,Calignedon16-byteboundariesdoubleA[4]__attribute__((aligned(16)));doubleB[4]__attribute__((aligned (16)));doubleC[4] __attribute__((aligned(16)));int lda =2;int i =0;//declareseveral128-bitvectorvariables__m128dc1,c2,a,b1,b2;

//InitializeA,B,Cforexample/*A=(notecolumnorder!)

1001*/A[0]=1.0;A[1]=0.0;A[2]=0.0;A[3]=1.0;

/*B=(notecolumnorder!)1324*/B[0]=1.0;B[1]=2.0;B[2]=3.0;B[3]=4.0;

/*C=(notecolumnorder!)0000*/C[0]=0.0;C[1]=0.0;C[2]=0.0;C[3]=0.0;

35

Example:2x 2MatrixMultiply(Part2of2)

//usedalignedloadstoset//c1=[c_11|c_21]c1=_mm_load_pd(C+0*lda);//c2=[c_12|c_22]c2=_mm_load_pd(C+1*lda);

for(i =0;i <2;i++){/*a=i =0:[a_11|a_21]i =1:[a_12|a_22]*/a=_mm_load_pd(A+i*lda);/*b1=i =0:[b_11|b_11]i =1:[b_21|b_21]*/b1=_mm_load1_pd(B+i+0*lda);/*b2=i =0:[b_12|b_12]i =1:[b_22|b_22]*/b2=_mm_load1_pd(B+i+1*lda);

/*c1=i =0:[c_11+a_11*b_11 |c_21+a_21*b_11]i =1:[c_11+a_21*b_21 |c_21+a_22*b_21]*/c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));/*c2=i =0:[c_12+a_11*b_12 |c_22+a_21*b_12]i =1:[c_12+a_21*b_22 |c_22+a_22*b_22]*/c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));

}

//storec1,c2backintoCforcompletion_mm_store_pd(C+0*lda,c1);_mm_store_pd(C+1*lda,c2);

//printCprintf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);return0;

}

36

Innerloopfromgcc –O-SL2: movapd (%rax,%rsi),%xmm1 //LoadalignedA[i,i+1]->m1

movddup (%rdx),%xmm0 //LoadB[j],duplicate->m0mulpd %xmm1,%xmm0 //Multiplym0*m1->m0addpd %xmm0,%xmm3 //Addm0+m3->m3movddup 16(%rdx),%xmm0 //LoadB[j+1],duplicate->m0mulpd %xmm0,%xmm1 //Multiplym0*m1->m1addpd %xmm1,%xmm2 //Addm1+m2->m2addq $16,%rax //rax+16->rax (i+=2)addq $8,%rdx //rdx+8->rdx (j+=1)cmpq $32,%rax //rax ==32?jne L2 //jumptoL2ifnotequalmovapd %xmm3,(%rcx) //storealignedm3intoC[k,k+1]movapd %xmm2,(%rdi) //storealignedm2intoC[l,l+1]

37

AndinConclusion,…

• FlynnTaxonomy• IntelSSESIMDInstructions

– Exploitdata-levelparallelisminloops– Oneinstructionfetchthatoperatesonmultipleoperandssimultaneously

– 128-bitXMMregisters• SSEInstructionsinC

– EmbedtheSSEmachineinstructionsdirectlyintoCprogramsthroughuseofintrinsics

– Achieveefficiencybeyondthatofoptimizingcompiler

38

Flynn Taxonomy,Data-level Parallelismcs61c/sp16/lec/26/2016Sp... · 2016. 3. 30. · Flynn* Taxonomy, 1966 • In 2013, SIMD and MIMD most common parallelism in architectures –

Documents