Top Banner
A Low‐Overhead Asynchronous A Low‐Overhead Asynchronous Interconnection Network for Interconnection Network for GALS Chip Multiprocessors GALS Chip Multiprocessors Michael N. Horak Michael N. Horak, University of Maryland University of Maryland Steven M. Nowick Steven M. Nowick, Columbia University Columbia University Matthew Matthew Carlberg Carlberg, UC Berkeley UC Berkeley Uzi Uzi Vishkin Vishkin, University of Maryland University of Maryland In ACM/IEEE Int. Symp. on Networks-on-Chip (NOCS-10)
84

A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Jun 30, 2018

Download

Documents

doananh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

ALow‐OverheadAsynchronousALow‐OverheadAsynchronousInterconnectionNetworkforInterconnectionNetworkforGALSChipMultiprocessorsGALSChipMultiprocessorsMichaelN.HorakMichaelN.Horak,,UniversityofMarylandUniversityofMaryland

StevenM.NowickStevenM.Nowick,,ColumbiaUniversityColumbiaUniversity

MatthewMatthewCarlbergCarlberg,,UCBerkeleyUCBerkeley

UziUziVishkinVishkin,,UniversityofMarylandUniversityofMaryland

In ACM/IEEE Int. Symp. on Networks-on-Chip (NOCS-10)

Page 2: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

ChallengesforDesigningNetworks‐on‐ChipChallengesforDesigningNetworks‐on‐Chip

•• PowerConsumptionPowerConsumption–– WillexceedfuturepowerbudgetsbyafactorofWillexceedfuturepowerbudgetsbyafactorof!"#!"#[1][1]

–– Globalclocks:consumelargefractionofoverallpowerGlobalclocks:consumelargefractionofoverallpower

•• PerformanceBottlenecksPerformanceBottlenecks–– LargenetworklatenciescauseperformancedegradationLargenetworklatenciescauseperformancedegradation

•• IncreasedDesignerResourcesIncreasedDesignerResources–– ManytechniquesareincompatiblewithcurrentCADtoolsManytechniquesareincompatiblewithcurrentCADtools

–– DifficultiesintegratingheterogeneousmodulesDifficultiesintegratingheterogeneousmodules•• ChipspartitionedintoChipspartitionedinto!"#$%&#'($%!%)*(+,!-%).!"#$%&#'($%!%)*(+,!-%).

[1]J.D.Owens,W.J.Dally,R.Ho,D.N.[1]J.D.Owens,W.J.Dally,R.Ho,D.N.JayasimhaJayasimha,S.W.,S.W.KecklerKeckler,andL.‐S.,andL.‐S.PehPeh..Researchchallengesforon‐chipinterconnectionnetworks.Researchchallengesforon‐chipinterconnectionnetworks.IEEEMicroIEEEMicro,27(5):96‐108,2007.,27(5):96‐108,2007.

Page 3: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

PotentialAdvantagesofAsynchronousDesignPotentialAdvantagesofAsynchronousDesign

•• LowerPowerLowerPower–– Noclockpowerconsumed:Noclockpowerconsumed:withoutwithoutclockgatingclockgating–– IdlecomponentsinherentlyconsumelowpowerIdlecomponentsinherentlyconsumelowpower

•• GreaterFlexibility/ModularityGreaterFlexibility/Modularity–– NoclockdistributionNoclockdistribution–– EasierintegrationbetweenmultipletimingdomainsEasierintegrationbetweenmultipletimingdomains–– SupportsreusablecomponentsSupportsreusablecomponents

•• LowerSystemLatencyLowerSystemLatency–– End‐to‐endtrafficwithoutclocksynchronizationEnd‐to‐endtrafficwithoutclocksynchronization

•• MoreResilienttoOn‐ChipVariationsMoreResilienttoOn‐ChipVariations–– CorrectoperationdependsonlocalizedtimingconstraintsCorrectoperationdependsonlocalizedtimingconstraints

Page 4: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

Page 5: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

Page 6: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric

•• SynchronousTerminalsSynchronousTerminals–– DifferentunrelatedclocksDifferentunrelatedclocks

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

Page 7: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mixed‐Timing(GALS)SystemMixed‐Timing(GALS)System

•• GloballyAsynchronous,GloballyAsynchronous,LocallySynchronous[2]LocallySynchronous[2]

•• AsynchronousNetworkAsynchronousNetwork–– ClocklessClocklessnetworkfabricnetworkfabric

•• SynchronousTerminalsSynchronousTerminals–– DifferentunrelatedclocksDifferentunrelatedclocks

•• Mixed‐TimingInterfacesMixed‐TimingInterfaces–– ProviderobustcommunicationProviderobustcommunicationbetweenSyncandbetweenSyncandAsyncAsyncdomainsdomains

[2]D.[2]D.ChapiroChapiro..Globally‐AsynchronousLocally‐SynchronousSystems.Globally‐AsynchronousLocally‐SynchronousSystems.PhDthesis,StanfordUniv.,1984.PhDthesis,StanfordUniv.,1984.

Page 8: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

AdvancesinGALSNetworks‐on‐ChipAdvancesinGALSNetworks‐on‐Chip•• CommercialDesignsCommercialDesigns

–– SilistixSilistix,Inc.,Inc.(J.Bainbridge,S.(J.Bainbridge,S.FurberFurber.IEEEMicro‐02).IEEEMicro‐02)

•• CHAINCHAIN™™workstoolsuite:heterogeneousworkstoolsuite:heterogeneousSOCsSOCs

–– FulcrumMicrosystemsFulcrumMicrosystems(A.Lines.Micro‐04)(A.Lines.Micro‐04)

•• FocalPointFocalPointchips:chips:high‐performanceEthernetroutinghigh‐performanceEthernetrouting

•• RecentRecentWorkWork–– AsynchronousNetwork‐on‐Chip(AsynchronousNetwork‐on‐Chip(ANoCANoC))((BeigneBeigne,,ClermidyClermidy,,VivetVivetetal.Async‐05)etal.Async‐05)

•• Wormholepacket‐switchedWormholepacket‐switchedNoCNoCwithlow‐latencyservicewithlow‐latencyservice

–– MANGOMANGOClocklessClocklessNetwork‐on‐ChipNetwork‐on‐Chip(T.(T.BjerregaardBjerregaard.DATE‐05).DATE‐05)

•• Offersquality‐of‐service(Offersquality‐of‐service(QoSQoS)guarantees)guarantees

–– RasPRasPOn‐ChipNetworkOn‐ChipNetwork(S.Hollis,S.W.Moore.ICCD‐06)(S.Hollis,S.W.Moore.ICCD‐06)

•• Utilizeshigh‐speedpulse‐basedsignalingUtilizeshigh‐speedpulse‐basedsignaling

–– SpiNNakerSpiNNakerProjectProject(Khan,Lester,(Khan,Lester,PlanaPlana,,FurberFurberetal.IJCNN‐08)etal.IJCNN‐08)

•• Massively‐parallelneuralsimulationMassively‐parallelneuralsimulation

Page 9: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

GALSGALSNOCsNOCs:TypicalCurrentTargets:TypicalCurrentTargets•• Low‐toModerate‐PerformanceEmbeddedSystemsLow‐toModerate‐PerformanceEmbeddedSystems

–– 200‐500MHz200‐500MHz–– HighsystemlatencyHighsystemlatency

•• ““Four‐PhaseReturn‐to‐ZeroFour‐PhaseReturn‐to‐Zero””ProtocolsProtocols–– Tworound‐trips/linkTworound‐trips/linkpertransactionpertransaction

•• ““Delay‐InsensitiveDataDelay‐InsensitiveData””Encoding(dual‐rail,1‐of‐4)Encoding(dual‐rail,1‐of‐4)–– Lowercodingefficiencythansingle‐railLowercodingefficiencythansingle‐rail

•• Complex‐FunctionalityRouterNodesComplex‐FunctionalityRouterNodes–– 5‐portrouterswithlayeredservices(5‐portrouterswithlayeredservices(QoSQoS,etc.),etc.)–– Highlatency/highareaHighlatency/higharea

•• CustomCircuitTechniques:CustomCircuitTechniques:–– Pulse‐basedsignaling,low‐swingPulse‐basedsignaling,low‐swingsignallingsignalling–– DynamicDynamiclogic,specializedcellslogic,specializedcells

Page 10: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

OutlineOutline

•• IntroductionIntroduction

•• TargetGALSNetworkDesignTargetGALSNetworkDesign•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives

•• ExperimentalResultsExperimentalResults

•• ConclusionsConclusions

Page 11: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors–– Medium‐toHigh‐PerformanceMedium‐toHigh‐Performance

Page 12: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming[3][3]–– MostgeneralGALStimingmodelMostgeneralGALStimingmodel

–– SupportmultiplesynchronousdomainswithunrelatedclockingSupportmultiplesynchronousdomainswithunrelatedclocking

–– PromotesreuseofIntellectualProperty(IP)modulesPromotesreuseofIntellectualProperty(IP)modules

[3]D.Messerschmitt,[3]D.Messerschmitt,““SynchronizationinDigitalSystemDesignSynchronizationinDigitalSystemDesign””,,IEEEJournalonSelectedAreasinCommunications,October1990IEEEJournalonSelectedAreasinCommunications,October1990

Page 13: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• TransitionSignalingTransitionSignaling(Two‐Phase)(Two‐Phase)–– MostexistingGALSMostexistingGALSNOCsNOCsuseuse““four‐phasehandshakingfour‐phasehandshaking””

•• 2roundtriplinkcommunications2roundtriplinkcommunicationspertransactionpertransaction

–– BenefitsofTwo‐Phase:BenefitsofTwo‐Phase:•• 1roundtriplinkcommunication1roundtriplinkcommunicationpertransactionpertransaction

•• improvedthroughput,powerimprovedthroughput,power……..

–– ChallengeofTwo‐Phase:ChallengeofTwo‐Phase:designinglightweightimplementationsdesigninglightweightimplementations•• Mostexisting2‐phasedesignsuse:Mostexisting2‐phasedesignsuse:

–– complexslowregisters:doublelatch,double‐edge‐triggered,capture/passcomplexslowregisters:doublelatch,double‐edge‐triggered,capture/pass»» [Seitz/Su[Seitz/Su““MosaicMosaic””93,93,BrunvandBrunvand91,Sutherland89]91,Sutherland89]

–– customcircuitcomponentscustomcircuitcomponents

TargetGALSNetworkDesignTargetGALSNetworkDesign

Page 14: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData–– MostexistingGALSMostexistingGALSNOCsNOCsuseuse““delay‐insensitivedelay‐insensitive””linkencodingslinkencodings

•• providegreattiming‐robustness==>cost=providegreattiming‐robustness==>cost=poorcodingefficiencypoorcodingefficiency

•• examples:dual‐rail,1‐of‐4examples:dual‐rail,1‐of‐4

–– ““Single‐RailBundledDataSingle‐RailBundledData””benefits:benefits:•• re‐usesynchronousre‐usesynchronousdatapathsdatapaths::1wire/bit1wire/bit++addedadded““requestrequest””•• excellentcodingefficiencyexcellentcodingefficiency

–– Challenge:requiresmatcheddelayforChallenge:requiresmatcheddelayfor““requestrequest””signalsignal•• 1‐sidedtimingconstraint:1‐sidedtimingconstraint:““requestrequest””mustarriveafterdatastablemustarriveafterdatastable

TargetGALSNetworkDesignTargetGALSNetworkDesign

Page 15: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData

•• HighPerformanceHighPerformance–– LowSystem‐LevelLatencyLowSystem‐LevelLatency

•• minimizeend‐to‐enddelayunderlighttomoderatetrafficminimizeend‐to‐enddelayunderlighttomoderatetraffic

–– HighSustainedThroughputHighSustainedThroughput•• maximizesteady‐statethroughputunderheavytrafficmaximizesteady‐statethroughputunderheavytraffic

TargetGALSNetworkDesignTargetGALSNetworkDesign

Page 16: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData

•• HighPerformanceHighPerformance

•• StandardCellMethodologyStandardCellMethodology–– UseexistingstandardcelllibrariesUseexistingstandardcelllibraries

•• onlyexception:onlyexception:analogarbitercircuitanalogarbitercircuit

–– Challenge:Challenge:timinganalysisusingexistingtoolstiminganalysisusingexistingtools

TargetGALSNetworkDesignTargetGALSNetworkDesign

Page 17: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

•• Shared‐MemoryChipMultiprocessorsShared‐MemoryChipMultiprocessors

•• ““HeterochronousHeterochronous””TimingTiming

•• Transition(Two‐Phase)SignalingTransition(Two‐Phase)Signaling

•• Single‐RailBundledDataSingle‐RailBundledData

•• HighPerformanceHighPerformance

•• StandardCellMethodologyStandardCellMethodology

•• Fine‐GrainedFine‐GrainedNetworkTopologyNetworkTopology–– LightweightnetworknodesLightweightnetworknodes

•• low‐functionalitylow‐functionalitylow‐radixroutercomponentslow‐radixroutercomponents

•• avoids5‐portrouterwithNorth/South/East/West/Localportsavoids5‐portrouterwithNorth/South/East/West/Localports

TargetGALSNetworkDesignTargetGALSNetworkDesign

Page 18: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

OutlineOutline

•• IntroductionIntroduction•• TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Background:XMTProcessor/Background:XMTProcessor/MoTMoTNetworkNetwork–– eXpliciteXplicitMulti‐Threading(XMT)ArchitectureMulti‐Threading(XMT)Architecture–– Mesh‐of‐Trees(Mesh‐of‐Trees(MoTMoT)NetworkTopology)NetworkTopology–– SynchronousRouterNodesSynchronousRouterNodes

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives•• ExperimentalResultsExperimentalResults•• ConclusionsConclusions

Page 19: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

XMTParallelArchitectureXMTParallelArchitecture

•• XMT=XMT=““eXpliciteXplicitMulti‐ThreadingMulti‐Threading””(1997‐present)[4](1997‐present)[4]–– LedbyProf.UziLedbyProf.UziVishkinVishkinatUniversityofMaryland,CollegeParkatUniversityofMaryland,CollegePark

•• BasedonParallelRandomAccessModel(PRAM)BasedonParallelRandomAccessModel(PRAM)–– LargestbodyofparallelalgorithmictheoryLargestbodyofparallelalgorithmictheory

•• EaseofProgrammabilityEaseofProgrammability–– XMT‐ClanguageXMT‐Clanguage+optimizingcompiler+optimizingcompiler–– Single‐ProgramMultiple‐Data(SPMD)programmingmethodologySingle‐ProgramMultiple‐Data(SPMD)programmingmethodology

•• DemonstratedtoProvideSignificantSpeedupsDemonstratedtoProvideSignificantSpeedups–– Performswellonirregularcomputations(BFS,ray‐tracing)Performswellonirregularcomputations(BFS,ray‐tracing)–– 100xspeedupforVHDLcircuitsimulationscomparedtoserial[5]100xspeedupforVHDLcircuitsimulationscomparedtoserial[5]

[4]D.Naishlos,J.Nuzman,C.‐W.Tseng,andU.Vishkin.“Towardsafirstverticalprototypingofanextremelyfine‐grainedparallelprogrammingapproach”,SPAA2001

[5]P.[5]P.GuGuandU.andU.VishkinVishkin,,““Casestudyofgate‐levellogicsimulationonanextremelyfine‐grainedchipCasestudyofgate‐levellogicsimulationonanextremelyfine‐grainedchipmultiprocessormultiprocessor””,,JournalofEmbeddedComputing,April2006JournalofEmbeddedComputing,April2006

Page 20: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

XMTParallelArchitectureXMTParallelArchitecture

•• ProcessingClustersProcessingClusters–– Groupofsimplepipelinedcores,Groupofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)

–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittletonosynchronizationwithlittletonosynchronization

–– ““IOSIOS””=independence‐of‐order=independence‐of‐ordersemantics:semantics:noWAW/WAR/RAWnoWAW/WAR/RAWdatahazardsbetweenthreadsdatahazardsbetweenthreads

Page 21: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

XMTParallelArchitectureXMTParallelArchitecture

•• ProcessingClustersProcessingClusters–– Groupsofsimplepipelinedcores,Groupsofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)

–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittlewithlittleornosynchronizationornosynchronization

•• DistributedCachesDistributedCaches–– SharedglobalL1datacacheSharedglobalL1datacache

–– NocachecoherenceproblemNocachecoherenceproblem

Page 22: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

XMTParallelArchitectureXMTParallelArchitecture

•• ProcessingClustersProcessingClusters–– Groupsofsimplepipelinedcores,Groupsofsimplepipelinedcores,e.g.e.g.16ThreadControlUnits(TCU)16ThreadControlUnits(TCU)

–– EachTCUexecutestocompletionEachTCUexecutestocompletionwithlittletonosynchronizationwithlittletonosynchronization

•• DistributedCachesDistributedCaches–– SharedglobalL1datacacheSharedglobalL1datacache

–– NocachecoherenceproblemNocachecoherenceproblem

•• NOCChallenge:NOCChallenge:highbandwidth/lowpowerrequirementshighbandwidth/lowpowerrequirements–– Manyconcurrentmemoryrequests(load/store)Manyconcurrentmemoryrequests(load/store)–– Shortpackets:1‐2flits/dynamically‐varyingtrafficShortpackets:1‐2flits/dynamically‐varyingtraffic–– LowlatencyrequiredforsystemperformanceLowlatencyrequiredforsystemperformance

Page 23: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

ProposedXMTParallelArchitecture:ProposedXMTParallelArchitecture:withGALSInterconnectionNetworkwithGALSInterconnectionNetwork

GALSNetwork…

Page 24: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mesh‐of‐TreesNetworkTopologyMesh‐of‐TreesNetworkTopology

•• VariantofclassicVariantofclassicMoTMoT

•• Nfan‐outtreesNfan‐outtrees–– RoutingonlyRoutingonly

–– RootatsourceterminalsRootatsourceterminals

•• Nfan‐intreesNfan‐intrees–– ArbitrationonlyArbitrationonly

–– RootatdestinationterminalsRootatdestinationterminals

$%&'(%)('*+,*-('+.

Page 25: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mesh‐of‐TreesNetworkTopologyMesh‐of‐TreesNetworkTopology•• HighThroughputHighThroughput

–– Uniqueroutingpaths(source/sink)Uniqueroutingpaths(source/sink)–– AvoidsinterferencepenaltiesAvoidsinterferencepenalties

•• FixedPathLengthFixedPathLength–– LogarithmicdepthLogarithmicdepth

•• DistributedLow‐RadixRoutingDistributedLow‐RadixRouting–– LimitedfunctionalitynodesLimitedfunctionalitynodes–– WormholedeterministicroutingWormholedeterministicrouting

•• ShowntoPerformWellforShowntoPerformWellforCMPsCMPs–– Providesveryhighsustainedthroughput[6]Providesveryhighsustainedthroughput[6]–– Highsaturationthroughput:Highsaturationthroughput:~91%~91%

[6]A.O.Balkan,G.[6]A.O.Balkan,G.QuQu,U.,U.VishkinVishkin,,““Mesh‐of‐Treesandalternativeinterconnectionnetworksforsingle‐Mesh‐of‐Treesandalternativeinterconnectionnetworksforsingle‐chipparallelismchipparallelism””,,IEEETransactionsonVeryLargeScaleIntegrationSystems,April2009IEEETransactionsonVeryLargeScaleIntegrationSystems,April2009

Page 26: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

SynchronousRoutingPrimitiveSynchronousRoutingPrimitive•• Fan‐OutComponentFan‐OutComponent[7][7]

–– 1Input,2Outputs1Input,2Outputs

–– SynchronousFlowControlSynchronousFlowControl•• Back‐pressuremechanismBack‐pressuremechanism

•• SignaltopreviousstagewhenSignaltopreviousstagewhennewdatacanbeacceptednewdatacanbeaccepted

•• BasedonBasedon““Latency‐InsensitiveDesignLatency‐InsensitiveDesign””[[CarloniCarlonietal.,TCAD01]etal.,TCAD01]

–– 2‐RegisterFIFO:2‐RegisterFIFO:B0,B1B0,B1

–– Allows1flit/cycleinsteady‐stateAllows1flit/cycleinsteady‐state•• AcceptnewdataandforwardstoreddataconcurrentlyAcceptnewdataandforwardstoreddataconcurrently

–– Cost:Cost:1extraauxiliaryregister1extraauxiliaryregister((flipflop‐basedflipflop‐based))

[7][7]A.O.Balkan,G.Qu,U.Vishkin.““AMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelAMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelProcessingProcessing””,,IEEEASAPSymposium(2006)IEEEASAPSymposium(2006)

Page 27: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

SynchronousArbitrationPrimitiveSynchronousArbitrationPrimitive

•• Fan‐InComponentFan‐InComponent[7][7]–– 2Inputs,1Output2Inputs,1Output

–– SynchronousFlowControlSynchronousFlowControl•• Back‐pressuremechanismBack‐pressuremechanism

•• BasedonBasedon““Latency‐InsensitiveDesignLatency‐InsensitiveDesign””–– 2‐Stage2‐StageFIFOsFIFOsateachinputportateachinputport–– Whenempty,latency=1cycleWhenempty,latency=1cycle

–– Whenstalled,latency=2+cyclesWhenstalled,latency=2+cycles•• Dependsonback‐pressureandsynchronousarbitrationDependsonback‐pressureandsynchronousarbitration

–– Cost:Cost:totalof4registerstotalof4registers(flip‐flopbased)(flip‐flopbased)

[7}[7}A.O.Balkan,G.Qu,U.Vishkin.““AMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelAMesh‐of‐TreesInterconnectionNetworkforSingle‐ChipParallelProcessingProcessing””,,IEEEASAPSymposium(2006)IEEEASAPSymposium(2006)

Page 28: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

OutlineOutline

•• IntroductionIntroduction

•• TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives–– Routingprimitive(Fan‐out)Routingprimitive(Fan‐out)

–– Arbitrationprimitive(Fan‐in)Arbitrationprimitive(Fan‐in)–– Mixed‐timinginterfacesMixed‐timinginterfaces

•• ExperimentalResultsExperimentalResults

•• ConclusionsConclusions

Page 29: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

Page 30: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

Handshaking Signals (Request / Acknowledge)Handshaking Signals (Request / Acknowledge)

Page 31: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

Binary Routing SignalBinary Routing Signal

Page 32: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

NewRoutingPrimitiveNewRoutingPrimitive

Req0Ack0Data0

Req1Ack1Data1

ReqAck

B(oolean) ‏

Data_In

Data ChannelsData Channels

Page 33: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

Page 34: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

LatchLatchControllersControllers

Req0Req1Ack

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req Ack Req1 Ack1

Latch ControllerLatch Controller

Page 35: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

Normally OpaqueNormally OpaqueLatch RegistersLatch Registers

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

Page 36: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

4)()/)+5/6/7'.+)84)()/)+5/6/7'.+)8)%%'23/96:";)%%'23/96:";

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

<)(3+=><)(3+=>

Page 37: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Latch Control 0

Toggle 0

LATCH

Req0

AckReq Ack0

Latch Control 1

Toggle 1

LATCH

Req1

AckReq Ack1

B

B

Data1

Data0

Data_In

4)()/)+5/6/7'.+)84)()/)+5/6/7'.+)8)%%'23/96:";)%%'23/96:";

,*-('+./0%'1'('23,*-('+./0%'1'('23

Req0Req1Ack

?@%*-.@A-(?@%*-.@A-(

/0,!()'1$/0,!()'1$((.$-*'.$-*'

Page 38: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

NewArbitrationPrimitiveNewArbitrationPrimitive

Req0Ack0Data0

Req_OutAck_InData_Out

Req1Ack1Data1

Page 39: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

NewArbitrationPrimitiveNewArbitrationPrimitive

Req0Ack0Data0

Req_OutAck_InData_Out

Req1Ack1Data1

Handshaking Signals (Request / Acknowledge)Handshaking Signals (Request / Acknowledge)

Page 40: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

NewArbitrationPrimitiveNewArbitrationPrimitive

Req0Ack0Data0

Req_OutAck_InData_Out

Req1Ack1Data1

Data ChannelsData Channels

Page 41: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Page 42: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Mutual ExclusionMutual ExclusionElement (Element (MutexMutex))

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Page 43: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Request ProtectionRequest ProtectionLatchesLatches(Normally Opaque)(Normally Opaque) ‏‏

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Page 44: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Data + Request Latch RegisterData + Request Latch Register (only one bank of latches required)(only one bank of latches required)

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Page 45: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

Acknowledgment Protection LatchesAcknowledgment Protection Latches(normally transparent)(normally transparent)

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

Page 46: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%

4)()A)(@4)()A)(@

F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J

$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

<)(3+=><)(3+=>

Page 47: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%

4)()A)(@4)()A)(@

F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J

$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

<)(3+=><)(3+=>

Page 48: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0Data1

Data_Out

Mux_Select

LATCH

B8*C/D*+(%*8/E+'(B8*C/D*+(%*8/E+'( <)(=@/D*+(%*883%<)(=@/D*+(%*883%

4)()A)(@4)()A)(@

F3C/5)()/)%%'237GF3C/5)()/)%%'237GH*88*C35/&>/,3I-37(JH*88*C35/&>/,3I-37(J<K/'7/'+'(')88>/*A)I-3J<K/'7/'+'(')88>/*A)I-3J

$%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

?@%*-.@A-(?@%*-.@A-((bestcase,(bestcase,alternatingalternatinginputs)inputs)

/0,!()'1$/0,!()'1$((.$-*'.$-*'

Page 49: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

WormholeRoutingCapabilityWormholeRoutingCapability

•• Goal:Goal:supporttransmissionofmulti‐flitpacketssupporttransmissionofmulti‐flitpackets–– example:XMTexample:XMT””storestorepacketspackets””=2flits=2flits(address+data)(address+data)

•• Solution:Solution:add1extraadd1extra““gluebitgluebit””toeachflittoeachflit–– Gluebit=1Gluebit=1notlastflitinpacketnotlastflitinpacket

–– Enhancedarbitrationprimitive:Enhancedarbitrationprimitive:biasbiasmutexmutexdecisiondecision•• ““winner‐take‐allwinner‐take‐all””strategy[strategy[Dally/TowlesDally/Towles]]

•• headerflittakesoverheaderflittakesovermutexmutex:glue=1:glue=1

•• lastflitreleaseslastflitreleasesmutexmutex:glue=:glue=00

Page 50: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

L8

L9

Req_Out

Ack_In

Data0Data1

Glue0

Glue1

Mux_Select

Data_OutLATCH

L+@)+=35/B8*C/D*+(%*8/E+'(L+@)+=35/B8*C/D*+(%*8/E+'(

4)()A)(@4)()A)(@

<)(=@/D*+(%*883%<)(=@/D*+(%*883% $%&'(%)('*+$%&'(%)('*+0%'1'('230%'1'('23

glue0 bit

glue1 bit

WormholeControl

Page 51: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

LinearPipelinePrimitiveLinearPipelinePrimitive

Req_OutAck_InData_Out

ReqAckData

-- Can beCan be inserted for buffering:inserted for buffering: to improve system-level throughput to improve system-level throughput-- Basis for Basis for design of new fan-in/fan-out primitivesdesign of new fan-in/fan-out primitives

[8]M.SinghandS.M.Nowick.“MOUSETRAP:High‐SpeedTransition‐SignalingAsynchronousPipelines,”IEEETransactionsonVLSISystems,vol.15:11,pp.1256‐1269(Nov.2007)

Page 52: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

LinearPipelinePrimitiveLinearPipelinePrimitive

Req_OutAck_InData_Out

ReqAckData

Handshaking Signals (Request and Acknowledgment)Handshaking Signals (Request and Acknowledgment)

Page 53: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

LinearPipelinePrimitiveLinearPipelinePrimitive

Req_OutAck_InData_Out

ReqAckData

Data ChannelsData Channels

Page 54: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Mixed‐TimingInterfacesMixed‐TimingInterfaces

•• UseExistingSynchronizingUseExistingSynchronizingFIFOsFIFOs[9][9](withsmall(withsmallmodifications)modifications)

–– SupportsSupports““heterochronousheterochronous””timingdomainstimingdomains–– NomodificationtoexistingcomponentsNomodificationtoexistingcomponents

•• ModularDesignModularDesign–– ReusableReusablePutPutandandGetGetcomponents(eithercomponents(eitherAsyncAsyncorSync)orSync)–– EachFIFOisarrayofidenticalcellsEachFIFOisarrayofidenticalcells

•• SupportsLow‐PowerOperationSupportsLow‐PowerOperation–– CircularFIFO:datadoesnotmoveCircularFIFO:datadoesnotmove

[9]T.[9]T.ChelceaChelceaandS.Nowick,andS.Nowick,““RobustInterfacesforMixed‐TimingSystemsRobustInterfacesforMixed‐TimingSystems””,,IEEETransactionsonVeryLargeScaleIntegrationSystems,August2004IEEETransactionsonVeryLargeScaleIntegrationSystems,August2004

Page 55: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

OutlineOutline

•• IntroductionIntroduction

•• TargetGALSNetworkDesignTargetGALSNetworkDesign

•• Background:Background:XMTProcessor/XMTProcessor/MoTMoTNetworkNetwork

•• AsynchronousNetworkPrimitivesAsynchronousNetworkPrimitives

•• ExperimentalResultsExperimentalResults•• ConclusionsConclusions

Page 56: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

EvaluationMethodologyEvaluationMethodology

•• DirectComparisonwithSynchronousDirectComparisonwithSynchronousMoTMoTNetworkNetwork–– IdenticalTechnology:IdenticalTechnology:IBM90nmCMOSprocessIBM90nmCMOSprocess

–– IdenticalFunctionalityIdenticalFunctionality:Sameroutingandarbitrationprimitives:Sameroutingandarbitrationprimitives

–– IdenticalTopology:IdenticalTopology:8‐terminalnetworkswithsame8‐terminalnetworkswithsamefloorplanfloorplan

•• EvaluateatMultipleLevelsofIntegrationEvaluateatMultipleLevelsofIntegration–– IsolatedAsynchronousPrimitivesIsolatedAsynchronousPrimitives(post‐layout)(post‐layout)

–– 8‐TerminalAsynchronousNetwork8‐TerminalAsynchronousNetwork(pre‐layoutwithwireestimates,(pre‐layoutwithwireestimates,‐‐interconnectionoflaid‐outrouterprimitives)‐‐interconnectionoflaid‐outrouterprimitives)

–– 8‐TerminalGALSNetwork8‐TerminalGALSNetwork

–– XMTArchitectureCo‐SimulationonParallelKernelsXMTArchitectureCo‐SimulationonParallelKernels

Page 57: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

ToolFlowToolFlow

•• ImplementedinIBM90nmtechnologyImplementedinIBM90nmtechnology–– PlacedandroutedwithCadenceSOCEncounterPlacedandroutedwithCadenceSOCEncounter

–– Simulatedasgate‐levelSimulatedasgate‐levelVerilogVerilogwithextracteddelayswithextracteddelays

•• StandardCellMethodologyStandardCellMethodology–– ARM90nmStandardCells(IBMCMOS9SF)ARM90nmStandardCells(IBMCMOS9SF)

•• Exception:MutualExclusionElementException:MutualExclusionElement–– DesignedusingtransistormodelsfromIBM90nmPDKDesignedusingtransistormodelsfromIBM90nmPDK

–– SimulatedinCadenceSimulatedinCadenceSpectreSpectre

–– MeasureddelaystocalibrateMeasureddelaystocalibrateVerilogVerilogbehavioralmodelbehavioralmodel

Page 58: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

RoutingPrimitiveComparison:RoutingPrimitiveComparison:AreaandPowerAreaandPower

225.61.822.062.06988.6988.6Synchronous

0.60.560.370.37358.4358.4Asynchronous

((μμW)W)((μμW)W)((pJpJ))((μμmm22))

IdleIdlePowerPower

LeakageLeakagePowerPower

Energy/Energy/PacketPacketAreaArea

•• Area:Area:–– 64%lessarea:64%lessarea:resultoflightweightresultoflightweightdatastoragedatastorage

•• 2flip‐flopregisters+extraMUX/DEMUX(sync)2flip‐flopregisters+extraMUX/DEMUX(sync)vsvs..2latchregisters(2latchregisters(asyncasync))

•• MUX/DEMUXoverhead(sync)MUX/DEMUXoverhead(sync)

•• Energy/Packet(1flit):Energy/Packet(1flit):–– 82%lessenergyperpacket82%lessenergyperpacket–– Steady‐statemeasurementonrandomtrafficSteady‐statemeasurementonrandomtraffic

Page 59: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

RoutingPrimitiveComparison:RoutingPrimitiveComparison:LatencyandThroughputLatencyandThroughput

1.931.931.931.931.931.93516516Synchronous

1.701.701.341.341.071.07546546Asynchronous

AlternatingAlternatingRandomRandomSingleSingle((psps))

Maximum Throughput Maximum Throughput (GFPS)(GFPS)LatencyLatencyComponent TypeComponent Type

•• Synchronous:Synchronous:UsingMaxClockRateUsingMaxClockRate(1.93GHz)(1.93GHz)

•• Latency:Latency:–– //546546psps((asyncasync)vs.516)vs.516psps(sync)(sync)

•• MaxThroughput(Giga‐flits/sec):MaxThroughput(Giga‐flits/sec):–– Single‐portedtraffic:Single‐portedtraffic:MMNMMNofsyncmax.ofsyncmax.(noconcurrency)(noconcurrency)

–– Randomtraffic:Randomtraffic:O"NO"Nofsyncmax.ofsyncmax.–– Alternatingtraffic:Alternatingtraffic:PPNPPNofsyncMax.ofsyncMax.(mostconcurrency)(mostconcurrency)

……expectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstagesexpectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstages

Page 60: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

ArbitrationPrimitiveComparison:ArbitrationPrimitiveComparison:AreaandPowerAreaandPower

388.64.133.533.532240.32240.3Synchronous

0.50.500.330.33349.3349.3Asynchronous

((μμW)W)((μμW)W)((pJpJ))((μμmm22))

IdleIdlePowerPower

LeakageLeakagePowerPower

Energy/Energy/PacketPacketAreaAreaComponent TypeComponent Type

•• Area:Area:–– 84%lessarea84%lessarea–– Duetolow‐overheaddatastorageDuetolow‐overheaddatastorage

•• 4flip‐flopregisters(sync)4flip‐flopregisters(sync)vs.vs.11latchregister(latchregister(asyncasync))

•• Energy/Packet(1flit):Energy/Packet(1flit):–– 91%lessenergyperpacket91%lessenergyperpacket–– Measuredsteady‐statepacketsarrivingatbothinputportsMeasuredsteady‐statepacketsarrivingatbothinputports

Page 61: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

ArbitrationPrimitiveComparison:ArbitrationPrimitiveComparison:LatencyandThroughputLatencyandThroughput

2.092.092.092.09474474Synchronous

2.042.041.081.08489489Asynchronous

Both PortsBoth PortsSingleSingle((psps))

Max. Throughput Max. Throughput (GFPS)(GFPS)LatencyLatencyComponent TypeComponent Type

•• Synchronous:UsingMaxClockRate(2.09GHz)Synchronous:UsingMaxClockRate(2.09GHz)

•• Latency:Latency:–– 489489psps((asyncasync))vs.474vs.474psps(sync)(sync)

•• Max.Throughput(Giga‐flits/sec):Max.Throughput(Giga‐flits/sec):–– SinglePortonly:SinglePortonly:M!NM!Nofsynchronousmax.ofsynchronousmax.–– TrafficatBothPorts:TrafficatBothPorts:QPNQPNofsynchronousmax.ofsynchronousmax.

……expectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstagesexpectsignificantfutureimprovementsbyinsertingsmall#ofFIFOstages

Page 62: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

8‐TerminalNetworkEvaluation8‐TerminalNetworkEvaluation

•• Head‐on‐HeadComparisonwithSyncNetworkHead‐on‐HeadComparisonwithSyncNetwork

•• ProjectedNetworkLayoutProjectedNetworkLayout–– Pre‐layoutPre‐layoutasyncasyncnetworknetwork

–– UsesUsespost‐layoutprimitivespost‐layoutprimitives,treatedashardIPmacros,with,treatedashardIPmacros,withassignedwiredelaysassignedwiredelays

–– ExtrapolatewiredelaysbasedonASICExtrapolatewiredelaysbasedonASICfloorplanfloorplanofSyncofSyncMoTMoT

•• ExperimentalSetupExperimentalSetup–– EvaluateperformanceunderEvaluateperformanceunderuniformlyrandominputtrafficuniformlyrandominputtraffic

–– 32‐bitflits32‐bitflits

Page 63: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Projected8‐TerminalNetworkLayoutProjected8‐TerminalNetworkLayout

•• BasedonBasedonFloorplanFloorplanofSynchronousofSynchronousMoTMoTTestASICTestASIC–– Designed/fabricatedatUMDinMarch2007Designed/fabricatedatUMDinMarch2007[10][10]

•• Networkdividedinto4partitions(P0,P1,P2,P3)Networkdividedinto4partitions(P0,P1,P2,P3)–– Fan‐InTreesexistentirelywithinonepartitionFan‐InTreesexistentirelywithinonepartition–– Fan‐OutTreesdistributedamongpartitionsFan‐OutTreesdistributedamongpartitions

•• AsynchronousProjectionMethodologyAsynchronousProjectionMethodology–– TreatasynchronousprimitivesareTreatasynchronousprimitivesarehardIPmacroshardIPmacros

•• allrouting,arbitrationprimitiveshavesametimingallrouting,arbitrationprimitiveshavesametiming

–– EvenlydistributeEvenlydistributegroupsofprimitivesgroupsofprimitives–– AssignAssigninter‐primitivewiredelaysinter‐primitivewiredelaysbasedonpositionbasedonposition

•• delaysonwiresassignedbasedontechnologyspecificationsdelaysonwiresassignedbasedontechnologyspecifications

[10]A.O.Balkan,M.N.[10]A.O.Balkan,M.N.HorakHorak,G.,G.QuQu,U.,U.VishkinVishkin..““Layout‐accuratedesignandimplementationofahigh‐Layout‐accuratedesignandimplementationofahigh‐throughputinterconnectionnetworkforsingle‐chipparallelprocessingthroughputinterconnectionnetworkforsingle‐chipparallelprocessing””,,HotInterconnects,August2007HotInterconnects,August2007

Page 64: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

Projected8‐TerminalNetworkLayoutProjected8‐TerminalNetworkLayout

ExampleFan-Out Tree

P0P0

P1P1

P2P2

P3P3

Page 65: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

CurrentCADToolFlows:SyncCurrentCADToolFlows:Syncvsvs..AsyncAsync

•• SynchronousSynthesis:SynchronousSynthesis:–– Automaticplace/routeoptimizationsAutomaticplace/routeoptimizations

–– Includescellresizing/repeaterinsertionIncludescellresizing/repeaterinsertion

•• AsynchronousSynthesis:AsynchronousSynthesis:–– Limitedoptimization:Limitedoptimization:hardmacros+regularmanualplacementhardmacros+regularmanualplacement–– Nocellresizing/repeaterinsertionNocellresizing/repeaterinsertion……muchpotentialforfutureperformanceimprovementmuchpotentialforfutureperformanceimprovement

•• CurrentlyDoNotDefineNecessaryTimingConstraintsCurrentlyDoNotDefineNecessaryTimingConstraints–– Noautomaticpath‐lengthmatchingNoautomaticpath‐lengthmatching–– NecessarytoenforcebundlingconstraintNecessarytoenforcebundlingconstraint

Page 66: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:400MHzSyncvs.400MHzSyncvs.AsyncAsync

Comparablethroughputfor entire

range of Sync

Sync hasat least 4.3x

higher latencyfor all Syncinput rates

Sync Max.Input Rate:102.4 Gbps

Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency

Page 67: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:800MHzSyncvs.800MHzSyncvs.AsyncAsync

Comparablethroughputfor entire

range of Sync

Sync has>1.7x

higher latencyfor input rates

up to 73%of Sync max.(150 Gbps)

Sync Max.Input Rate:204.8 Gbps

Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency

Page 68: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

AsyncAsyncNetworkPerformanceComparison:NetworkPerformanceComparison:1.36GHzSyncvs.1.36GHzSyncvs.AsyncAsync

Comparablethroughput

for ratesup to

55% of Sync max.(190 Gbps)

Lower latencyfor input

rates up to 43% of

Sync max.(150 Gbps)

Note: sync max. input rate limited by clock frequencyNote: sync max. input rate limited by clock frequency

Sync Max.Input Rate:348.2 Gbps

Page 69: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

GALSNetworkPerformanceComparisonGALSNetworkPerformanceComparison

•• ExperimentalSetupExperimentalSetup–– CreateterminalstogeneratetrafficandrecordmeasurementsCreateterminalstogeneratetrafficandrecordmeasurements

–– TerminalsgenerateTerminalsgenerateuniformlyrandominputtrafficuniformlyrandominputtraffic

•• ResultsNormalizedtoClockRateResultsNormalizedtoClockRate–– ThroughputunitsThroughputunits(normalized)(normalized)::flitspercycleperportflitspercycleperport

–– LatencyunitsLatencyunits(normalized)(normalized)::#clockcycles#clockcycles

–– Syncnetworkresults:Syncnetworkresults:alwayssamealwayssamerelativetoclockcyclesrelativetoclockcycles

–– AsyncAsyncnetworkresults:networkresults:varyvarywithclockratewithclockrate

Page 70: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:400MHzGALSvs.Sync400MHzGALSvs.Sync

Comparablethroughput

for alltraffic rates

Sync has52% higher

latencyup to 80%input traffic

Page 71: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:600MHzGALSvs.Sync600MHzGALSvs.Sync

Comparablethroughputup to 65%input traffic

Lower latencyup to 60%input traffic

Page 72: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

GALSNetworkPerformanceComparison:GALSNetworkPerformanceComparison:800MHzGALSvs.Sync800MHzGALSvs.Sync

Comparablethroughputup to 52%input traffic

Lower latencyup to 29%

input traffic,comparable

latencyup to 40%input traffic

Page 73: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

XMTParallelKernelSimulationsXMTParallelKernelSimulations

•• Goal:IntegratewithSynchronousXMTParallelArchitectureGoal:IntegratewithSynchronousXMTParallelArchitecture–– XMTXMTVerilogVerilogRTLdescriptionwithGALSnetworkRTLdescriptionwithGALSnetwork

•• XMTParallelKernelsXMTParallelKernels–– ArraySummation(add)ArraySummation(add)

•• Computesumof3millionelementsinarrayComputesumof3millionelementsinarray

–– MatrixMultiplication(MatrixMultiplication(mmulmmul))•• Computeproductoftwo64x64matricesComputeproductoftwo64x64matrices

–– Breadth‐FirstSearch(Breadth‐FirstSearch(bfsbfs))•• RunXMTBFSalgorithmwith100,000verticesand1millionedgesRunXMTBFSalgorithmwith100,000verticesand1millionedges

–– ArrayIncrement(ArrayIncrement(a_inca_inc))•• Incrementall32kelementsofanarrayIncrementall32kelementsofanarray

Page 74: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

XMTParallelKernelSimulationsXMTParallelKernelSimulations

•• XMTProcessorConfigurationXMTProcessorConfiguration–– 8ProcessingClusters(168ProcessingClusters(16TCUsTCUseach)=each)=128128TCUTCU’’sstotaltotal

–– 8DistributedL1D‐CacheModules(64KBtotal)8DistributedL1D‐CacheModules(64KBtotal)

•• SimulateGALSXMTatDifferentClockFrequenciesSimulateGALSXMTatDifferentClockFrequencies–– 200,400,700MHz200,400,700MHz

•• CompareSpeedupsRelativetoSynchronousXMTCompareSpeedupsRelativetoSynchronousXMT–– Valuesgreaterthan1.0indicatebetterperformanceValuesgreaterthan1.0indicatebetterperformance

Page 75: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

GALSXMTPerformanceComparisonGALSXMTPerformanceComparison

GALS XMThas similar

performance for200, 400 MHz

Only moderate degradationat 700 MHz(a_inc: 37%

decrease)

(Graph arranged in order of increasing network utilization)(Graph arranged in order of increasing network utilization)

Page 76: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

ConclusionsConclusions•• NewGALSNetworkforChipMultiprocessorsNewGALSNetworkforChipMultiprocessors

–– Low‐overheadnetworkforLow‐overheadnetworkfor““heterochronousheterochronous””InterfacesInterfaces

•• DesignofTwoNewAsynchronousRouterCellsDesignofTwoNewAsynchronousRouterCells–– RoutingRoutingandandarbitrationarbitrationcircuitscircuits

•• OverviewofResultsOverviewofResults–– RouterPrimitivesRouterPrimitives

•• 64‐84%lessarea,82‐91%lessenergy/packet64‐84%lessarea,82‐91%lessenergy/packet•• Latency&throughput(forbalancedtraffic)=~2Latency&throughput(forbalancedtraffic)=~2Gflits/secGflits/sec

–– System‐LevelPerformanceSystem‐LevelPerformance•• AsyncAsyncnetworkcomparisonwith800MHzsyncnetwork:networkcomparisonwith800MHzsyncnetwork:

–– ComparablethroughputComparablethroughputacrossallinputtrafficacrossallinputtraffic

–– 1.7xlowerlatency1.7xlowerlatencyuptoupto73%maxinputtraffic73%maxinputtraffic

•• GALSnetworkcomparisonwith800MHzsyncnetwork:GALSnetworkcomparisonwith800MHzsyncnetwork:–– ComparablethroughputComparablethroughputuptoupto52%maxinputtraffic52%maxinputtraffic

–– LowerlatencyLowerlatencyuptoupto29%maxinputtraffic29%maxinputtraffic

Page 77: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

FutureDirectionsFutureDirections•• ArchitecturalOptimizationArchitecturalOptimization

–– InsertlinearpipelinestagesonlongwirestoimprovethroughputInsertlinearpipelinestagesonlongwirestoimprovethroughput

•• CircuitOptimizationCircuitOptimization–– Improvedesignsofrouting/arbitrationprimitivesImprovedesignsofrouting/arbitrationprimitives–– Mixed‐timingFIFOoptimizationsMixed‐timingFIFOoptimizations

•• AsynchronousTopologyOptimizationAsynchronousTopologyOptimization–– AreaimprovementsAreaimprovementsusinghybridusinghybridMoT‐ButterflyMoT‐Butterfly[Balkanetal.,DAC‐08][Balkanetal.,DAC‐08]

•• IntegratewithSynchronousPhysicalCADToolFlowIntegratewithSynchronousPhysicalCADToolFlow–– GoalGoal=leverage=leverageexistingcommercialexistingcommercialtechniquestechniques

•• TimingconstraintspecificationandsynthesisofTimingconstraintspecificationandsynthesisofunclockedunclockedtimingpathstimingpaths

•• BuildonautomatedBuildonautomatedasyncasyncflowofflowof[[Quinton/Greenstreet/WiltonQuinton/Greenstreet/WiltonTVLSITVLSI‘‘08]08]

•• Optimizedplacement,routing,gateresizingandrepeaterinsertionOptimizedplacement,routing,gateresizingandrepeaterinsertion

•• TargetAlternativeParallelArchitectures/MemorySystemsTargetAlternativeParallelArchitectures/MemorySystems

Page 78: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components
Page 79: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

BACKUPSLIDESBACKUPSLIDES

Page 80: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

TypesofMixed‐Timing(GALS)SystemsTypesofMixed‐Timing(GALS)Systems

•• PseudochronousPseudochronous–– SameFrequency,ConstantPhaseDifferenceSameFrequency,ConstantPhaseDifference

•• MesochronousMesochronous–– SameFrequency,UndefinedPhaseDifferenceSameFrequency,UndefinedPhaseDifference

•• PlesiochronousPlesiochronous–– NearlyexactFrequencyandPhaseDifferenceNearlyexactFrequencyandPhaseDifference

•• HeterochronousHeterochronous–– UndefinedFrequencyandPhaseDifferenceUndefinedFrequencyandPhaseDifference

Page 81: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

MOUSETRAPAsynchronousPipelinesMOUSETRAPAsynchronousPipelines

•• FastCommunicationFastCommunication–– TransitionTransitionsignalingsignaling(2‐phase)handshaking(2‐phase)handshaking ‏‏

•• Synchronous‐StyleChannelEncodingSynchronous‐StyleChannelEncoding–– Single‐railbundleddataprotocolSingle‐railbundleddataprotocol

•• LowLatencyLowLatency–– 1TransparentDLatchdelayforemptystage1TransparentDLatchdelayforemptystage

•• Minimal‐OverheadLatchControllerMinimal‐OverheadLatchController–– 1XNORGate1XNORGate

Page 82: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

reqN

ackN-1

reqN+1

ackN

Data Latch

Latch Controller

doneN

Data in Data out

Stage NStage N-1 Stage N+1

En

MOUSETRAP:ABasicFIFOMOUSETRAP:ABasicFIFO(nocomputation)(nocomputation)

Stagescommunicateusingtransition-signaling:

Page 83: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

reqN

ackN-1

reqN+1

ackN

Data Latch

Latch Controller

doneN

Data in Data out

Stage NStage N-1 Stage N+1

En

MOUSETRAP:ABasicFIFOMOUSETRAP:ABasicFIFO(nocomputation)(nocomputation)

Stagescommunicateusingtransition-signaling:

1 transition1 transitionper data item!per data item!

One Data Item

Page 84: A Low‐Overhead Asynchronous Interconnection Network for …nowick/fin-NOCS2010-slides-extended.pdf · 2010-06-10 · clock power consumed: without clock gating –Idle components

BasicMixed‐ClockFIFO(Sync‐Sync)BasicMixed‐ClockFIFO(Sync‐Sync)

cell cell cell cell cell

Get

Con

trolle

r

Empty Detector

Full DetectorPut

Controller

full

req_put

data_putCLK_put

CLK_getdata_getreq_get

valid_getempty

•• Sync‐SyncFIFOSync‐SyncFIFO:usesSynchronous:usesSynchronousPutPutandandGetGetModulesModules–– Sync‐Syncisoneof4mixed‐timingSync‐Syncisoneof4mixed‐timingFIFOsFIFOs

•• MixedMixedAsyncAsync+Sync+SyncFIFOFIFO’’ss:modularchanges:modularchanges–– Sync‐AsyncSync‐Async::usesSynchronoususesSynchronousPutPut(top)andAsynchronous(top)andAsynchronousGetGet–– Async‐SyncAsync‐Sync::usesSynchronoususesSynchronousGetGet(bottom)andAsynchronous(bottom)andAsynchronousPutPut