Page 1
SustainableSilicon:Energy‐EfficientVLSIInterconnectResearch
PatrickChiang,AssistantProfessor(andStudents)OregonStateUniversityVLSIResearchGroup
[email protected]
DOE,Aug2011h#p://eecs.oregonstate.edu/research/vlsi
Collaborators: Department of Energy, Intel, Boeing, LSI, SRC, Air Force Research Laboratory, HP Labs, National Science Foundation
Page 2
Whatdoesexascalemean?
IBMRoadrunner(12000PowerXCPUs6000AMDOpterons)
2018
Power 10‐20MW
Speed(petaflop=1015) 1000
Space ~2x
1000 ~1000x
? ????
IBMRoadrunner(12000PowerXCPUs6000AMDOpterons)
2009
Power 2.35MW
Speed(petaflop=1015FLOP/s)
1.04
Space 296Racks6000sq.R.
Memory 103Terabytes
Cost $125M
IBM: Roadrunner, Los Alamos (2009)
Page 3
Notjustsuper‐computers
109 OPS < 1mW
SENSOR 1018 FLOPS
~1MW
EXASCALE 1012 FLOPS
~1W
EMBEDDED
Source: DARPA Exascale Study, 2008
Page 4
4MajorChallenges
1) MemoryandStorage
– Capacity;Latency2)ConcurrencyandLocality
– fCLK=1GHzParallelism
– So[ware/hardware
4)EnergyandPower
– Interconnect
Random Inputs
3)ResiliencyChallenge
– LowVDD– ProcessVariability
Page 5
InterconnectEnergyWall
• “TheEnergyandPowerChallengeisthemostpervasiveofthefour,andhasitsrootsintheinabilityofthegrouptoprojectanycombina_onofcurrentlymaturetechnologiesthatwilldeliversufficientlypowerfulsystemsinanyclassatthedesiredpowerlevels.“
• “Akeyobserva_onofthestudyisthatitmaybeeasiertosolvethepowerproblemassociatedwithbasecomputa_onthanitwillbetoreducetheproblemoftranspor_ngdatafromonesitetoanother‐onthesamechip,betweencloselycoupledchipsinacommonpackage,orbetweendifferentracksonoppositesidesofalargemachineroom…”
Source: DARPA Exascale Study, Sep. 2008
• Microprocessors:“at0.13umapproximately51%ofmicroprocessorpowerwasconsumedbyinterconnect,withaprojec_onthatwithoutchangesindesignphilosophy,inthenextfiveyearsupto80%ofmicroprocessorpowerwillbeconsumedbyinterconnect”,ITRS‐07
Page 6
PowerDensity
Source: ITRS 2010
Page 7
Notenoughpoweravailable
Source: Bill Dally, 2011
Page 9
InterconnectEnergies
3.5% capacitance improvement / year
Gap widening > 100x
No solution < 1pJ/bit
Page 10
Alterna_ves• 3DStacking
Source: G. Loh, 2008; Intel 2010
• SiliconPhotonics
1. Off-chip bandwidth: – 3D: dense through-silicon vias – Photonics: wavelength division multiplexing
2. Off‐chipenergy:
– 3D:makethingscloser
– Photonics:uselow‐losslight
Electrical transceivers
Page 11
Problem1:On‐ChipWires• On‐chipwirepowerdoesnotscale
– Dominatedbyinterconnectcapacitance(CVDD2)
ON-CHIP (Status Quo): 100 - 200fJ/bit/mm
[DOE, Exascale Workshop]
OUR GOAL: < 5fJ/bit/mm
Page 12
Problem2:Off‐chipI/O
Source: Exascale Roadmap Meeting, Dec. 2009
OFF-CHIP: 1-10pJ/bit (1-10mW / Gbps)
CONVENTIONAL
OUR GOAL: < 0.1pJ/bit
Page 13
Low‐PowerInterconnectsOverview
1. On‐ChipI/O
– Network‐on‐a‐chipwithreduced‐swinginterconnect– Fundamentallimitstolow‐voltageswing
2. Off‐ChipI/O– Sub‐1mW/Gbpsoff‐chipI/O
Page 14
Near‐ThresholdOpera_on(VDD~0.4V)
[S. Hanson, 2006] MIT, MICHIGAN, PURDUE, INTEL
Page 15
Sync_um:near‐thresholdparallelprocessor
- Throughput of SIMD; energy-efficiency of near-threshold operation
- Eight parallel lanes, at near-threshold (VDD=0.5V)
- Variation Tolerance: Razor-like detection/recovery per lane - Lane weaving - Decoupled instruction queues"
E. Krimer, R. Pawlowski, M. Erez, P. Chiang, "Synctium: a Near-Threshold Stream Processor for Energy-Constrained Parallel Applications", IEEE Computer Architecture Letters, 2010.
Page 16
(1)Energy‐Efficient,On‐ChipLinks
• OurGoal:low‐poweron‐chiplinks– Analoglow‐voltageswing:
– (3)XBARS– (4)LINKTRAVERSAL
• RouterPower:– (1)Buffering:30%– (2)Arbitra_on:10%– (3)XBAR:30%– (4)LINKS:30%
Intel, 80 Cores, ISSCC 2007
1
2
3
4
Page 17
TokenFlowControlNoC(Li‐ShiuanPeh,MIT)
• Conven_onalRouter:– Eachhoprequires4cycles
• ProposedTFCRouter:– Firsthoprequires4cycles– Followinghopsrequire2cycles
• Tokensforadvancealloca_on– Ifli#leconges_on,bufferingisskipped
• NoCpowerdominatedbyXBARandLT– TFCreducesbufferwrites
Page 18
R R R R R R R R R R R R R R R R R R R G G R R R R R G R R G R R R R G R R G R R R R R G G R R R R R R R R R R R R R R R R R R R
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
XBAR
LINKS
Page 19
Low‐Swing,Bitcell‐BasedCrossbar
0 20 40 60 80
100 120 140
0 0.25 0.5 0.75 1 1.25
Diff
. Mod
e C
ross
talk
(mV
)
Aggressor Distance from Center of Diff. Pair (um)
Shielded
Unshielded
Page 20
MeasurementSummary
21 ICCD, 2010
Page 21
NextGoal:1‐5fJ/mmConven_onalFullSwing
Schinkel(JSSCC‘09)
Stojanovic(ISSCC‘09)
OurTFCRouter
NewGoal
WireLength 1mm 2mm 10mm 1mm 1mm‐5mm
Supply 1.2V 1.2V ‐ 1.2V 0.5‐1.0V
TransceiverArea
21um2 TX:20um 2880um2 23um2 20‐30um2
SignalSwing 1.2V 120mV 200mV 250mV 50mV
Energy/Bit 305fJ 105fJ 356fJ 28‐60fJ 1‐5fJ/mm
• Determine:Fundamentallimitstoenergy‐efficient,on‐chiplink
• GOAL:5‐50ximprovementinon‐chiplinkenergy
– Energyscalability– Lowarea– Robust
Page 22
Energy‐ScalableOn‐ChipTxRx
Channel: 1-4mm
Digital Offset Cancellation Decision Feedback Equalization
Page 24
On‐ChipLinks–MeasurementResults
Page 25
InverterCross‐OverPoint
Page 26
Off‐ChipI/OScalingTrends
• I/Opowerefficiencyisafunc_onof:• datarate
• processtechnology
• channellossequaliza_oncomplexity
• Projectgoal:<1mW/Gbpsoverascalable5‐10GbpsdatarateSource: S. Palermo, Texas A&M
Page 27
(2)Off‐ChipLinks:GlobalClockDistribu_onOp_miza_on(withIntel)
• Clockdominatespowerinlinks
• Clockdominatespowerinlinks
• Isthereawaytoshareclockpower?
Clk/CDR is 54% of Total Power (10Gbs)
Clk/CDR is 71% of Total Power (15Gbs)
Intel (VLSI, 2007)
Page 28
Conven_onalClockgenera_onanddistribu_on
1) CLOCK DISTRIBUTION: 0.5W in the global distribution alone
2) DESKEW: Significant energy in multi-phase generation
Page 29
Chip#1:Injec_on‐LockedReceiverArchitecture
31
Page 30
ILRO:ExtensionofAdler’sEqua_onAdler’s doesn’t apply
to ring oscillators!
Page 31
Chip#2:Near‐Threshold,0.15mW/Gbps,8GbpsSerialLinkReceiver
– Sub‐harmonicInjec_on‐Locking– OperatesatVDD~0.6V
Page 33
MeasuredDeskewRangeandBER
Page 34
ComparisonTable
accepted, CICC-2011
Page 35
Take‐AwayPoints• Lowerenergysiliconispossible
– Aggressiveinterconnectcircuitsshow:• Off‐Chip:5ximprovements• On‐Chip:50ximprovements
• Reliabilityatlow‐Vddisissue– Explorein‐situadapta_ontoself‐healautonomously
• MagicbulletsdoNOTexist– Lowerenergy‐‐>lowerperformance
– Dynamicallyadapttheen_resystem– Requiresco‐designinterac_onbetweensoRware,architecture,andunderlyingsilicon
Page 36
Outreach• Publica_ons
– K.Hu,T.Jiang,S.Palermo,andP.Chiang,"Low‐Power8Gb/sNear‐ThresholdSerialLinkReceiversUsingSuper‐HarmonicInjec_onLockingin65nmCMOS",Accepted,IEEECustomIntegratedCircuitsConference,2011.
– R.Bai,J.Wang,L.Xia,F.Zhang,Z.Yang,W.Hu,P.Chiang,"SinusoidalClockSamplingforMul_‐GigahertzADCs",accepted,IEEETransac_onsonCircuitsandSystem‐I,2011.
– LingliXia,JingguangWang,WillBeake,JacobPostman,andPatrickYinChiang,"Sub‐2ps,Sta_cPhaseErrorCalibra_onTechniqueIncorpora_ngMeasurementUncertaintyCancella_onforMul_‐GigahertzTime‐InterleavedT/HCircuits",accepted,IEEETransac_onsonCircuitsandSystem‐I,2011.
– K.Hu,L.Wu,andP.Chiang,"ACompara_veStudyof20‐Gb/sNRZandDuobinarySignalingUsingSta_s_calAnalysis",accepted,IEEETransac_onsonVLSISystems,May2011.
– T.Jiang,W.Liu,C.Zhong,F.Zhong,P.Chiang,"Single‐Channel,1.25‐GS/s,6‐bit,Loop‐UnrolledAsynchronousSAR‐ADCin40nm‐CMOS",IEEECustomIntegratedCircuitsConference,Sep.2010.
– J.PostmanandP.Chiang,"Energy‐EfficientTransceiverCircuitsforShort‐RangeOn‐chipInterconnects",Accepted,IEEECustomIntegratedCircuitsConference,2011.
– B.Goska,J.Postman,M.Erez,P.Chiang,"Hardware/so[wareco‐designforenergy‐efficientparallelcompu_ng,"accepted,DepartmentofEnergySciDACConference,July2011.
– JosephCrop,RobertPawlowski,NarimanMoezzi‐Madani,JarrodJacksonandPatrickChiang,"DesignAutoma_onMethodologyforImprovingtheVariabilityofSynthesizedDigitalCircuitsOpera_ngintheSub/Near‐ThresholdRegime,"accepted,WorkshoponLow‐PowerSystemonChip(SoC),2ndGreenCompu_ngConference,2011.
– T.Krishna,J.Postman,L.‐S.Peh,P.Chiang,"SWIFT:ASWing‐reducedInterconnectForaToken‐basedNetwork‐on‐Chipin90nmCMOS",Interna_onalConferenceonComputerDesign(ICCD),Amsterdam,Netherlands,October2010.
– E.Krimer,R.Pawlowski,M.Erez,P.Chiang,"Sync_um:aNear‐ThresholdStreamProcessorforEnergy‐ConstrainedParallelApplica_ons",IEEEComputerArchitectureLemers,Jan/June2010.
• Talks– P.Chiang,"Hardware/so[wareco‐designforenergy‐efficientparallelcompu_ng,"accepted,DepartmentofEnergySciDACConference,July2011.– P.Chiang,Carnegie‐Mellon,May2011.
– P.Chiang,Princeton,May2011.– P.Chiang,UC‐Davis,Feb2011.
– P.Chiang,Stanford,Feb2011.– P.Chiang,Intel,Mar2011.– P.Chiang,UC‐SanDiego,Jan2011.
– P.Chiang,Broadcom,Dec2010.– P.Chiang,Illinois,Oct2010.
– P.Chiang,Michigan,Oct2010.– P.Chiang,USC,Oct2010.