High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths

High-ThroughputHigh-ThroughputAsynchronous Pipelines forAsynchronous Pipelines for

Fine-Grain Dynamic Fine-Grain Dynamic DatapathsDatapaths

Montek Singh and Steven NowickMontek Singh and Steven Nowick

Columbia UniversityColumbia UniversityNew York, USANew York, USA

{montek,nowick}@cs.columbia.edu{montek,nowick}@cs.columbia.eduhttp://www.http://www.cscs..columbiacolumbia..eduedu/~/~montekmontek

Intl. Symp. Adv. Res. Asynchronous Circ. Syst. (ASYNC), April 2-6, 2000, Eilat, Israel.Intl. Symp. Adv. Res. Asynchronous Circ. Syst. (ASYNC), April 2-6, 2000, Eilat, Israel.

2

OutlineOutline

IntroductionIntroduction

Background: Williams’ PS0 pipelinesBackground: Williams’ PS0 pipelines

New Pipeline DesignsNew Pipeline Designs Dual-Rail:Dual-Rail: LP3/1, LP2/2 and LP2/1 LP3/1, LP2/2 and LP2/1 Single-Rail:Single-Rail: LP LPSRSR2/12/1

Practical Issue: Handling slow Practical Issue: Handling slow

environmentsenvironments

Results and ConclusionsResults and Conclusions

3

Why Dynamic Logic?Why Dynamic Logic?

Potentially:Potentially:

Higher speedHigher speed

Smaller areaSmaller area

““Latch-free” pipelines:Latch-free” pipelines:Logic gate itself provides an Logic gate itself provides an implicitimplicit latch latch

lower latencylower latencyshorter cycle timeshorter cycle timesmaller area –– smaller area –– very important in gate-level pipelining!very important in gate-level pipelining!

Our Focus:Our Focus: Dynamic logic pipelinesDynamic logic pipelines

4

How Do We Achieve High How Do We Achieve High Throughput?Throughput?

Introduce novel pipeline protocols:Introduce novel pipeline protocols: specifically target dynamic logicspecifically target dynamic logic reduce impact of handshaking delaysreduce impact of handshaking delays

shorter cycle timesshorter cycle times

Pipeline at very fine granularity:Pipeline at very fine granularity: ““gate-level:”gate-level:” each stage is a single-gate deep each stage is a single-gate deep

highest throughputs possiblehighest throughputs possible

latch-freelatch-free datapaths especially desirable datapaths especially desirabledynamic logic is a natural matchdynamic logic is a natural match

5

Prior Work: Asynchronous Prior Work: Asynchronous PipelinesPipelines Sutherland (1989), Yun/Beerel/Arceo (1996)Sutherland (1989), Yun/Beerel/Arceo (1996)

very elegant 2-phase control very elegant 2-phase control expensive transition expensive transition latcheslatches

Day/Woods (1995), Furber/Liu (1996)Day/Woods (1995), Furber/Liu (1996)4-phase control 4-phase control simpler latches, but complex simpler latches, but complex

controllerscontrollers

Kol/Ginosar (1997)Kol/Ginosar (1997)double latches double latches greater concurrency, but area-expensive greater concurrency, but area-expensive

Molnar et al. (1997-99)Molnar et al. (1997-99)Two designs: Two designs: asp*asp* and and micropipeline micropipeline both very fast, but: both very fast, but:

– asp*:asp*: complex timing, cannot handle latch-free dynamic complex timing, cannot handle latch-free dynamic datapathsdatapaths

– micropipeline:micropipeline: area-expensive, area-expensive, cannot do logic processing at all!cannot do logic processing at all!

Williams (1991), Martin (1997)Williams (1991), Martin (1997)dynamic stages dynamic stages no explicit latches! no explicit latches! low latency low latency throughput still limitedthroughput still limited

6

BackgroundBackground







7

PS0 Pipelines PS0 Pipelines (Williams 1986-91)(Williams 1986-91)

Basic Architecture:Basic Architecture:

FunctionBlock

CompletionDetector

Datain

Dataout

PC

8

PS0 Function BlockPS0 Function Block

Each output is produced using a Each output is produced using a dynamic dynamic gate:gate:

Pull-downPull-downstackstack

““keeper”keeper”

evaluationevaluationcontrolcontrol

prechargeprechargecontrolcontrol

PCPC

datadatainputsinputs

datadataoutputsoutputs

to completionto completiondetectordetector

9

Dual-Rail Completion Dual-Rail Completion DetectorDetector

OROR together two rails of each bit together two rails of each bit Combine results using Combine results using C-elementC-element

CCDoneDone

ORORbitbit00

ORORbitbit11

ORORbitbitnn

10

Precharge Precharge Evaluate: Evaluate: another 3 eventsanother 3 eventsPrecharge Precharge Evaluate: Evaluate: another 3 eventsanother 3 eventsComplete cycle: Complete cycle: 6 events6 eventsComplete cycle: Complete cycle: 6 events6 events

N+1 indicates “done”N+1 indicates “done”

PRECHARGEPRECHARGE N: when N+1 completes evaluation N: when N+1 completes evaluation EVALUATE EVALUATE N: when N+1 completes N: when N+1 completes

prechargingprecharging

PS0 ProtocolPS0 Protocol

11 22 33

44

55

66

N evaluatesN evaluates N+1 evaluatesN+1 evaluates N+2 evaluatesN+2 evaluates


N+1 prechargesN+1 precharges


33

Evaluate Evaluate Precharge: Precharge: 3 events3 eventsEvaluate Evaluate Precharge: Precharge: 3 events3 events

NN N+1N+1 N+2N+2

11

PS0 PerformancePS0 Performance

TEVAL Evaluation Time

TPRECH Precharge Time

TDETECT Completion Detection Time

11 22 33

44

55

66

DETECTPRECHEVAL TTT 23Cycle Time =

12

New Pipeline DesignsNew Pipeline Designs







13

Overview of ApproachOverview of Approach

Our Goal:Our Goal: Shorter cycle time, without degrading Shorter cycle time, without degrading latencylatency

Our Approach:Our Approach: Use “ Use “LLookahead ookahead PProtocols” rotocols” (LP):(LP):main idea: main idea: anticipateanticipate critical events based on critical events based on richer observationricher observation

Two new protocol optimizations:Two new protocol optimizations: ““Early evaluation:”Early evaluation:”

give stage give stage head-starthead-start on evaluation by observing events on evaluation by observing events further down the pipelinefurther down the pipeline

(actually, a similar idea proposed by Williams in (actually, a similar idea proposed by Williams in PA0PA0,,but our designs exploit it much better)but our designs exploit it much better)

““Early done:”Early done:” stage signals “done” when it is stage signals “done” when it is about toabout to precharge/evaluate precharge/evaluate

14

Uses Uses “early evaluation:”“early evaluation:” each stage now has each stage now has twotwo control inputs control inputs

the new input comes from the new input comes from two stages aheadtwo stages ahead evaluate N as soon as N+1 evaluate N as soon as N+1 startsstarts precharging precharging

Dual-Rail Design #1: Dual-Rail Design #1: LP3/1LP3/1

Datain

Dataout

PCPC EvalEval

From N+2From N+2From N+2From N+2

NN N+1N+1 N+2N+2

15

LP3/1 ProtocolLP3/1 Protocol PRECHARGEPRECHARGE N: when N+1 completes N: when N+1 completes

evaluationevaluation EVALUATEEVALUATE N: when N: when N+2N+2 completes completes

evaluationevaluationNew!New!

11 22 33

Enables “early evaluation!”Enables “early evaluation!”

44

N evaluatesN evaluates N+1 evaluatesN+1 evaluates


N+2 evaluatesN+2 evaluates

NN N+1N+1 N+2N+2


33

16

PS0PS0PS0PS0

LP3/1LP3/1LP3/1LP3/1

LP3/1: Comparison with PS0LP3/1: Comparison with PS0

11

11

33

33

22

22

55

Only 4 events in cycle!Only 4 events in cycle!

6 events in cycle6 events in cycle

44

4466

NN N+1N+1 N+2N+2

NN N+1N+1 N+2N+2

17

11 22 33

44

LP3/1 PerformanceLP3/1 Performance

DETECTEVAL TT 3Cycle Time =Cycle Time =

saved pathsaved path

Savings over PS0:Savings over PS0: 1 Precharge + 1 Completion Detection1 Precharge + 1 Completion Detection

18

Inside a Stage: Merging Two Inside a Stage: Merging Two ControlsControls

Precharge Precharge when when PC=1PC=1(and Eval=0)(and Eval=0)

Evaluate Evaluate “early”“early” when when Eval=1Eval=1(or PC=0)(or PC=0) Pull-downPull-down

stackstack

““keeper”keeper”

PC (From Stage N+1)PC (From Stage N+1)Eval (From Stage N+2)Eval (From Stage N+2)

NANDNAND

A NAND gate combinesA NAND gate combinesthe two control inputs:the two control inputs:

Problem:Problem: “early”“early” Eval=1Eval=1 is non- is non-persistent!persistent!

it may get de-asserted it may get de-asserted beforebefore the stage has the stage has completed evaluation! completed evaluation!

Problem:Problem: “early”“early” Eval=1Eval=1 is non- is non-persistent!persistent!

it may get de-asserted it may get de-asserted beforebefore the stage has the stage has completed evaluation! completed evaluation!

19

LP3/1 Timing Constraints: LP3/1 Timing Constraints: ExampleExample

Observation:Observation: PC=0PC=0 soon aftersoon after Eval=1, Eval=1, and is persistentand is persistent use PC as safe use PC as safe “takeover” “takeover” for Eval!for Eval!

Solution:Solution: no change! no change!

Timing Constraint:Timing Constraint: PC=0PC=0 arrives arrives beforebefore Eval=1Eval=1 is de- is de-

assertedassertedsimple one-sided timing requirementsimple one-sided timing requirementother constraints as well… all easily satisfied in practiceother constraints as well… all easily satisfied in practice

PC (From Stage N+1)PC (From Stage N+1)Eval (From Stage N+2)Eval (From Stage N+2)

NANDNAND

Problem:Problem: “early”“early” Eval=1Eval=1 is non-persistent! is non-persistent!

20


Uses Uses “early done:”“early done:” completion detector now completion detector now beforebefore functional blockfunctional block

stage indicates “done” when stage indicates “done” when about toabout to precharge/evaluate precharge/evaluate

FunctionBlock“early”

CompletionDetector

Datain

Dataout

21

LP2/2 Completion DetectorLP2/2 Completion Detector

Modified completion detectors needed:Modified completion detectors needed: DoneDone=1=1 when stage starts evaluating, and inputs when stage starts evaluating, and inputs

validvalid DoneDone=0=0 when stage starts precharging when stage starts precharging

asymmetric C-elementasymmetric C-element

CCDoneDone

ORORbitbit00

ORORbitbit11

ORORbitbitnn

++++++

PCPC

22

N+1 “early done”N+1 “early done”

11 22

44

LP2/2 ProtocolLP2/2 ProtocolCompletion detection occurs Completion detection occurs in parallelin parallel with evaluation/precharge: with evaluation/precharge:


NN N+1N+1 N+2N+2

22


33

33


23

LP2/2 PerformanceLP2/2 Performance

11 22

3344


LP2/2 savings over PS0: LP2/2 savings over PS0: 1 Evaluation + 1 Precharge1 Evaluation + 1 Precharge

24


Hybrid of LP3/1 and LP2/2.Hybrid of LP3/1 and LP2/2. Combines: Combines: early evaluationearly evaluation of LP3/1 of LP3/1 early doneearly done of LP2/2 of LP2/2


25

New Pipeline DesignsNew Pipeline Designs







26

Single-Rail Design: Single-Rail Design: LPLPSRSR2/12/1

Derivative of LP2/1, adapted to single-rail:Derivative of LP2/1, adapted to single-rail:bundled-data: bundled-data: matched delaysmatched delays instead of completion instead of completion

detectorsdetectors

delaydelay delaydelay delaydelay

““Ack”Ack” to previous stages is to previous stages is “tapped off early”“tapped off early”once in evaluate (precharge), dynamic logic insensitive to input changesonce in evaluate (precharge), dynamic logic insensitive to input changes

27

PC and Eval are combined exactly as in LP3/1PC and Eval are combined exactly as in LP3/1

Inside an LPInside an LPSRSR2/1 Stage2/1 Stage

““done”done” generated by an generated by an asymmetric C- asymmetric C-element element

donedone=1=1 when stage evaluates, and when stage evaluates, and data inputs data inputs validvalid donedone=0=0 when stage precharges when stage precharges

PC (From Stage N+1)PC (From Stage N+1)

Eval (From Stage N+2)Eval (From Stage N+2)

NANDNAND

aCaC++

““ack”ack”

““req” inreq” in

data indata in data outdata out

““req” outreq” out

matcheddelay

donedone

28

LPLPSRSR2/1 Protocol2/1 Protocol

11 22

33

aCEVAL TT 2Cycle Time =Cycle Time =

element-C asymmetric throughDelay aCT



NN N+1N+1 N+2N+2

22

N+1 evaluatesN+1 evaluates


29

Practical Issue: Handling Slow Practical Issue: Handling Slow EnvironmentsEnvironments

We inherit a timing assumption from Williams’ We inherit a timing assumption from Williams’ PS0:PS0: Input (left) environment Input (left) environment must precharge reasonably fastmust precharge reasonably fast

Problem:Problem:If environment is If environment is stuck in precharge,stuck in precharge,

all pipelines (incl. PS0) will malfunction!all pipelines (incl. PS0) will malfunction!

Our Solution:Our Solution: Add a special Add a special robustrobust controller for 1 controller for 1stst stage stage

simply synchronizes input environment and pipelinesimply synchronizes input environment and pipeline delay critical events until environment has finished prechargedelay critical events until environment has finished precharge

Modular solution overcomes shortcoming of Williams’ PS0Modular solution overcomes shortcoming of Williams’ PS0

No serious throughput overheadNo serious throughput overhead real bottleneck is the slow environment!real bottleneck is the slow environment!

30








31

ResultsResults

Designed/simulated FIFO’s for each Designed/simulated FIFO’s for each pipeline style pipeline style

Experimental Setup:Experimental Setup: design:design: 4-bit wide, 10-stage FIFO 4-bit wide, 10-stage FIFO technology:technology: 0.6 0.6 HP CMOS HP CMOS operating conditions:operating conditions: 3.3 V and 300°K 3.3 V and 300°K

32

Throughput

Design Mega items/sec Improvement (%)

PS0 420 -

LP3/1 590 40%

LP2/2 760 79%

LP2/1 860 102%

LPSR2/1 1208 188%

dual-raildual-rail

single-railsingle-rail

Comparison with Williams’ Comparison with Williams’ PS0PS0

LP2/1:LP2/1: >2X faster>2X faster than Williams’ PS0 than Williams’ PS0 LPLPSRSR2/1:2/1: 1.2 Giga items/sec1.2 Giga items/sec

33

Comparison: Comparison: LPLPSRSR2/1 vs. Molnar 2/1 vs. Molnar FIFO’sFIFO’s

LPLPSRSR2/1 FIFO:2/1 FIFO: 1.2 Giga items/sec 1.2 Giga items/secAdding logic processing to FIFO:Adding logic processing to FIFO:

simply fold logicsimply fold logic into dynamic gate into dynamic gate little overhead little overhead

Comparison with Molnar FIFO’s:Comparison with Molnar FIFO’s: asp* FIFO:asp* FIFO: 1.1 Giga items/sec 1.1 Giga items/sec

more complex timing assumptions more complex timing assumptions not easily not easily formalizedformalized

requires explicit latches, separate from logic!requires explicit latches, separate from logic!adding logic processing adding logic processing betweenbetween stages stages significant significant

overheadoverhead

micropipeline:micropipeline: 1.7 Giga items/sec 1.7 Giga items/sec two parallel FIFO’s, each only 0.85 Giga/sectwo parallel FIFO’s, each only 0.85 Giga/secvery expensive transition latchesvery expensive transition latchescannot add logic processing to FIFO!cannot add logic processing to FIFO!

34

datapath widthdatapath width= 32 dual-rail bits!= 32 dual-rail bits!

Practicality of Gate-Level Practicality of Gate-Level PipeliningPipelining

When datapath is wide:When datapath is wide:

Can often split into narrow Can often split into narrow “streams”“streams”

comp. comp. ddet. et. ffairly airly low cost!low cost!

Use Use “localized”“localized” completion detector completion detector for each stream:for each stream:

need to examine only a few bitsneed to examine only a few bits small fan-insmall fan-in

send “done” to only a few gatessend “done” to only a few gates small fan-outsmall fan-out

donedone

fan-out=2fan-out=2

comp. det.comp. det.fan-in = 2fan-in = 2

35

ConclusionsConclusions

Introduced several new dynamic pipelines:Introduced several new dynamic pipelines: Use Use two novel protocols:two novel protocols:

– ““early evaluation”early evaluation”– ““early done”early done”

Especially suitable for Especially suitable for fine-grain (gate-level) pipeliningfine-grain (gate-level) pipelining

Very high throughputs obtained:Very high throughputs obtained:– dual-rail:dual-rail: >2X improvement>2X improvement over Williams’ PS0 over Williams’ PS0– single-rail:single-rail: 1.2 Giga items/second1.2 Giga items/second in 0.6 in 0.6 CMOS CMOS

Use easy-to-satisfy, one-sided timing constraintsUse easy-to-satisfy, one-sided timing constraints

Robustly handle arbitrary-speed environmentsRobustly handle arbitrary-speed environments– overcome a major shortcoming of Williams’ PS0 pipelinesovercome a major shortcoming of Williams’ PS0 pipelines

Recent Improvement: Even faster single-rail pipeline Recent Improvement: Even faster single-rail pipeline (WVLSI’00)(WVLSI’00)

High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths

Documents

dynamic gate

logic gate

evaluationevaluate n

dynamic stages

conclusionswhy dynamic

dynamic logic pipelineshow

stages aheadevaluate

desirabledynamic logic