Page 1
CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture
Lecture13–VLIW
KrsteAsanovicElectricalEngineeringandComputerSciences
UniversityofCaliforniaatBerkeley
http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152
Page 2
LastTimeinLecture12
§ Branchprediction– temporal,historyofasinglebranch– spatial,basedonpaththroughmultiplebranches
§ BranchHistoryTable(BHT)vs.BranchHistoryBuffer(BTB)– tradeoffincapacityversuslatency
§ Return-AddressStack(RAS)– specializedstructuretopredictsubroutinereturnaddresses
§ Fetchingmorethanonebasicblockpercycle– predictingmultiplebranches– tracecache
2
Page 3
SuperscalarControlLogicScaling
§ EachissuedinstructionmustsomehowcheckagainstW*Linstructions,i.e.,growthinhardwareµW*(W*L)
§ Forin-ordermachines,Lisrelatedtopipelinelatenciesandcheckisdoneduringissue(interlocksorscoreboard)
§ Forout-of-ordermachines,Lalsoincludestimespentininstructionbuffers(instructionwindoworROB),andcheckisdonebybroadcastingtagstowaitinginstructionsatwriteback(completion)
§ AsWincreases,largerinstructionwindowisneededtofindenoughparallelismtokeepmachinebusy=>greaterL
=>Out-of-ordercontrollogicgrowsfasterthanW2 (~W3)3
LifetimeL
IssueGroup
PreviouslyIssued
Instructions
IssueWidthW
Page 4
Out-of-OrderControlComplexity:MIPSR10000
4
ControlLogic
[SGI/MIPSTechnologiesInc.,1995]
Page 5
SequentialISABottleneck
5
Checkinstructiondependencies
Superscalarprocessor
a = foo(b);
for (i=0, i<
Sequentialsourcecode
Superscalarcompiler
Findindependentoperations
Scheduleoperations
Sequentialmachinecode
Scheduleexecution
Page 6
VLIW:VeryLongInstructionWord
§Multipleoperationspackedintooneinstruction§ Eachoperationslotisforafixedfunction§ Constantoperationlatenciesarespecified§ Architecturerequiresguaranteeof:
– Parallelismwithinaninstruction=>nocross-operationRAWcheck
– Nodatausebeforedataready=>nodatainterlocks6
TwoIntegerUnits,Single-CycleLatency
TwoLoad/StoreUnits,Three-CycleLatency TwoFloating-PointUnits,
Four-CycleLatency
IntOp2 MemOp1 MemOp2 FPOp1 FPOp2Int Op1
Page 7
EarlyVLIWMachines
§ FPSAP120B(1976)– scientificattachedarrayprocessor– firstcommercialwideinstructionmachine– hand-codedvectormathlibrariesusingsoftwarepipeliningandloopunrolling
§Multiflow Trace(1987)– commercializationofideasfromFisher’sYalegroupincluding“tracescheduling”
– availableinconfigurationswith7,14,or28operations/instruction
– 28operationspackedintoa1024-bitinstructionword
§ Cydrome Cydra-5(1987)– 7operationsencodedin256-bitinstructionword– rotatingregisterfile
7
Page 8
VLIWCompilerResponsibilities
§Scheduleoperationstomaximizeparallelexecution
§Guaranteesintra-instructionparallelism
§Scheduletoavoiddatahazards(nointerlocks)– TypicallyseparatesoperationswithexplicitNOPs
8
Page 9
LoopExecution
9
HowmanyFPops/cycle?
for (i=0; i<N; i++)
B[i] = A[i] + C;Int1 Int 2 M1 M2 FP+ FPx
loop: fldadd x1
fadd
fsdadd x2 bne
1 fadd / 8 cycles = 0.125
loop: fld f1, 0(x1)
add x1, 8
fadd f2, f0, f1
fsd f2, 0(x2)
add x2, 8
bne x1, x3, loop
Compile
Schedule
Page 10
LoopUnrolling
10
for (i=0; i<N; i++)
B[i] = A[i] + C;
for (i=0; i<N; i+=4)
{
B[i] = A[i] + C;
B[i+1] = A[i+1] + C;
B[i+2] = A[i+2] + C;
B[i+3] = A[i+3] + C;
}
Unroll inner loop to perform 4 iterations at once
Need to handle values of N that are not multiples of unrolling factor with final cleanup loop
Page 11
SchedulingLoopUnrolledCode
11
loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)fsd f8, 24(x2)add x2, 32bne x1, x3, loop
Schedule
Int1 Int 2 M1 M2 FP+ FPx
loop:
Unroll 4 ways
fld f1fld f2fld f3fld f4add x1 fadd f5
fadd f6fadd f7fadd f8
fsd f5fsd f6fsd f7fsd f8add x2 bne
How many FLOPS/cycle?4 fadds / 11 cycles = 0.36
Page 12
SoftwarePipelining
12
HowmanyFLOPS/cycle?
loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)add x2, 32fsd f8, -8(x2)bne x1, x3, loop
Int1 Int 2 M1 M2 FP+ FPxUnroll 4 ways firstfld f1fld f2fld f3fld f4
fadd f5fadd f6fadd f7fadd f8
fsd f5fsd f6fsd f7fsd f8
add x1
add x2bne
fld f1fld f2fld f3fld f4
fadd f5fadd f6fadd f7fadd f8
fsd f5fsd f6fsd f7fsd f8
add x1
add x2bne
fld f1fld f2fld f3fld f4
fadd f5fadd f6fadd f7fadd f8
fsd f5
add x1
loop:iterate
prolog
epilog
4 fadds / 4 cycles = 1
Page 13
SoftwarePipeliningvs.LoopUnrolling
13
time
performance
time
performance
Loop Unrolled
Software Pipelined
Startup overhead
Wind-down overhead
Loop Iteration
Loop Iteration
Software pipelining pays startup/wind-down costs only once per loop, not once per iteration
Page 14
CS152Administrivia
§ Lab2extension,dueFridayMarch15§ PS3dueMondayMarch18§ Midtermgradeswillbereleasedtoday§ RegraderequestswillbethroughGradescope
– WindowopensFriday,3/15/19at4pm(aftersection)– WindowclosesFriday,3/22/19at12pm(beforesection)
14
Page 15
0
2
4
6
8
10
12
20.00
24.00
28.00
32.00
36.00
40.00
44.00
48.00
52.00
56.00
60.00
64.00
68.00
70.00
Midterm 1 Grades: mean = 41.4, σ = 11.3CS152Administrivia
15
Page 16
CS252
CS252Administrivia
§ ReadingsnextweekonOoO superscalarmicroprocessors
16
Page 17
Whatiftherearenoloops?
17
§ Brancheslimitbasicblocksizeincontrol-flowintensiveirregularcode
§ DifficulttofindILPinindividualbasicblocks
Basicblock
Page 18
TraceScheduling[Fisher,Ellis]
18
§ Pickstringofbasicblocks,atrace,thatrepresentsmostfrequentbranchpath
§ Useprofilingfeedback orcompilerheuristicstofindcommonbranchpaths
§ Schedulewhole“trace”atonce§ Addfixup codetocopewithbranchesjumpingoutoftrace
Page 19
Problemswith“Classic”VLIW
§ Object-codecompatibility– havetorecompileallcodeforeverymachine,evenfortwomachinesinsamegeneration
§ Objectcodesize– instructionpaddingwastesinstructionmemory/cache– loopunrolling/softwarepipeliningreplicatescode
§ Schedulingvariablelatencymemoryoperations– cachesand/ormemorybankconflictsimposestaticallyunpredictablevariability
§ Knowingbranchprobabilities– Profilingrequiresansignificantextrastepinbuildprocess
§ Schedulingforstaticallyunpredictablebranches– optimalschedulevarieswithbranchpath
19
Page 20
VLIWInstructionEncoding
§ Schemestoreduceeffectofunusedfields– Compressedformatinmemory,expandonI-cacherefill
• usedinMultiflow Trace• introducesinstructionaddressingchallenge
– Markparallelgroups• usedinTMS320C6xDSPs,IntelIA-64
– Provideasingle-opVLIWinstruction• Cydra-5UniOp instructions
20
Group 1 Group 2 Group 3
Page 21
IntelItanium,EPICIA-64
§ EPICisthestyleofarchitecture(cf.CISC,RISC)– ExplicitlyParallelInstructionComputing(reallyjustVLIW)
§ IA-64isIntel’schosenISA(cf.x86,MIPS)– IA-64=IntelArchitecture64-bit– Anobject-code-compatibleVLIW
§ MercedwasfirstItaniumimplementation(cf.8086)– Firstcustomershipmentexpected1997(actually2001)– McKinley,secondimplementationshippedin2002– Recentversion,Poulson,eightcores,32nm,announced2011
21
Page 22
EightCoreItanium“Poulson”[Intel2011]
22
§ 8cores§ 1-cycle16KBL1I&Dcaches§ 9-cycle512KBL2I-cache§ 8-cycle256KBL2D-cache§ 32MBsharedL3cache§ 544mm2 in32nmCMOS§ Over3billiontransistors
§ Coresare2-waymultithreaded§ 6instruction/cyclefetch
– Two128-bitbundles
§ Upto12insts/cycleexecute
Page 23
IA-64InstructionFormat
§ Templatebitsdescribegroupingoftheseinstructionswithothersinadjacentbundles
§ Eachgroupcontainsinstructionsthatcanexecuteinparallel
23
Instruction 2 Instruction 1 Instruction 0 Template
128-bit instruction bundle
group i group i+1 group i+2group i-1
bundle j bundle j+1bundle j+2bundle j-1
Page 24
IA-64Registers
§ 128GeneralPurpose64-bitIntegerRegisters§ 128GeneralPurpose64/80-bitFloatingPointRegisters§ 641-bitPredicateRegisters
§ GPRs “rotate”toreducecodesizeforsoftwarepipelinedloops– Rotationisasimpleformofregisterrenamingallowingoneinstructiontoaddressdifferentphysicalregistersoneachiteration
24
Page 25
CS252
RotatingRegisterFiles
25
Problems:Scheduledloopsrequirelotsofregisters,Lotsofduplicatedcodeinprolog,epilog
Solution:Allocatenewsetofregistersforeachloopiteration
25
Page 26
CS252
RotatingRegisterFile
26
P0P1P2P3P4P5P6P7
RRB=3
+R1
RotatingRegisterBase(RRB)registerpointstobaseofcurrentregisterset.Valueaddedontologicalregisterspecifier togivephysicalregisternumber.Usually,splitintorotatingandnon-rotatingregisters.
26
Page 27
CS252
RotatingRegisterFile(PreviousLoopExample)
27
bloopsd f9, ()fadd f5, f4, ...ld f1, ()
Three cycle load latency encoded as difference of 3
in register specifier number (f4 - f1 = 3)
Four cycle fadd latency encoded as difference of 4
in register specifier number (f9 – f5 = 4)
bloopsd P17, ()fadd P13, P12,ld P9, () RRB=8bloopsd P16, ()fadd P12, P11,ld P8, () RRB=7bloopsd P15, ()fadd P11, P10,ld P7, () RRB=6bloopsd P14, ()fadd P10, P9,ld P6, () RRB=5bloopsd P13, ()fadd P9, P8,ld P5, () RRB=4bloopsd P12, ()fadd P8, P7,ld P4, () RRB=3bloopsd P11, ()fadd P7, P6,ld P3, () RRB=2bloopsd P10, ()fadd P6, P5,ld P2, () RRB=1
27
Page 28
IA-64PredicatedExecution
28
Problem:Mispredicted brancheslimitILPSolution:Eliminatehardtopredictbrancheswithpredicatedexecution
– AlmostallIA-64instructionscanbeexecutedconditionallyunderpredicate– InstructionbecomesNOPifpredicateregisterfalse
Inst 1Inst 2br a==b, b2
Inst 3Inst 4br b3
Inst 5Inst 6
Inst 7Inst 8
b0:
b1:
b2:
b3:
if
else
then
Four basic blocks
Inst 1Inst 2p1,p2 <- cmp(a==b)(p1) Inst 3 || (p2) Inst 5(p1) Inst 4 || (p2) Inst 6Inst 7Inst 8
Predication
One basic block
Mahlke et al, ISCA95: On average >50% branches removed
Warning:Complicatesbypassing!
Page 29
CS252
IA-64SpeculativeExecution
29
Problem: Branchesrestrictcompilercodemotion
Inst 1Inst 2br a==b, b2
Load r1Use r1Inst 3
Can’t move load above branch because might cause spurious exception
Load.s r1Inst 1Inst 2br a==b, b2
Chk.s r1Use r1Inst 3
Speculative load never causes exception, but sets “poison” bit on destination register
Check for exception in original home block jumps to fixup code if exception detected
Particularly useful for scheduling long latency loads early
Solution: Speculativeoperationsthatdon’tcauseexceptions
Page 30
CS252
IA-64DataSpeculation
30
Problem:Possiblememoryhazardslimitcodescheduling
Requires associative hardware in address check table
Inst 1Inst 2Store
Load r1Use r1Inst 3
Can’t move load above store because store might be to same address
Load.a r1Inst 1Inst 2Store
Load.cUse r1Inst 3
Data speculative load adds address to address check table
Store invalidates any matching loads in address check table
Check if load invalid (or missing), jump to fixup code if so
Solution:Hardwaretocheckpointerhazards
Page 31
LimitsofStaticScheduling
§ Unpredictablebranches§ Variablememorylatency(unpredictablecachemisses)§ Codesizeexplosion§ Compilercomplexity§ Despiteseveralattempts,VLIWhasfailedingeneral-purposecomputingarena(sofar).– MorecomplexVLIWarchitecturesareclosetoin-ordersuperscalarincomplexity,norealadvantageonlargecomplexapps.
§ SuccessfulinembeddedDSPmarket– SimplerVLIWswithmoreconstrainedenvironment,friendliercode.
31
Page 32
IntelKillsItanium
§ DonaldKnuth“ …Itaniumapproachthatwassupposedtobesoterrific—untilitturnedoutthatthewished-forcompilerswerebasicallyimpossibletowrite.”
§ “IntelofficiallyannouncedtheendoflifeandproductdiscontinuanceoftheItaniumCPUfamilyonJanuary30th,2019”,Wikipedia
32
Page 33
Acknowledgements
§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:– Arvind (MIT)– JoelEmer (Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz (UCB)– DavidPatterson(UCB)
33