Asynchronous Pipelines Asynchronous Pipelines for for Fine-Grain Dynamic Fine-Grain Dynamic Datapaths Datapaths Montek Singh and Steven Montek Singh and Steven Nowick Nowick Columbia University Columbia University New York, USA New York, USA {montek,nowick}@cs.columbia.edu {montek,nowick}@cs.columbia.edu http://www. http://www. cs cs . . columbia columbia . . edu edu /~ /~ montek montek Intl. Symp. Adv. Res. Asynchronous Circ. Syst. (ASYNC), April 2-6, 2000, Eilat, Intl. Symp. Adv. Res. Asynchronous Circ. Syst. (ASYNC), April 2-6, 2000, Eilat, Israel. Israel.
35
Embed
High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths
High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths. Montek Singh and Steven Nowick Columbia University New York, USA {montek,nowick}@cs.columbia.edu http://www.cs.columbia.edu/~montek. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
High-ThroughputHigh-ThroughputAsynchronous Pipelines forAsynchronous Pipelines for
““Latch-free” pipelines:Latch-free” pipelines:Logic gate itself provides an Logic gate itself provides an implicitimplicit latch latch
lower latencylower latencyshorter cycle timeshorter cycle timesmaller area –– smaller area –– very important in gate-level pipelining!very important in gate-level pipelining!
Pipeline at very fine granularity:Pipeline at very fine granularity: ““gate-level:”gate-level:” each stage is a single-gate deep each stage is a single-gate deep
highest throughputs possiblehighest throughputs possible
latch-freelatch-free datapaths especially desirable datapaths especially desirabledynamic logic is a natural matchdynamic logic is a natural match
very elegant 2-phase control very elegant 2-phase control expensive transition expensive transition latcheslatches
Day/Woods (1995), Furber/Liu (1996)Day/Woods (1995), Furber/Liu (1996)4-phase control 4-phase control simpler latches, but complex simpler latches, but complex
controllerscontrollers
Kol/Ginosar (1997)Kol/Ginosar (1997)double latches double latches greater concurrency, but area-expensive greater concurrency, but area-expensive
Molnar et al. (1997-99)Molnar et al. (1997-99)Two designs: Two designs: asp*asp* and and micropipeline micropipeline both very fast, but: both very fast, but:
– micropipeline:micropipeline: area-expensive, area-expensive, cannot do logic processing at all!cannot do logic processing at all!
Williams (1991), Martin (1997)Williams (1991), Martin (1997)dynamic stages dynamic stages no explicit latches! no explicit latches! low latency low latency throughput still limitedthroughput still limited
Our Goal:Our Goal: Shorter cycle time, without degrading Shorter cycle time, without degrading latencylatency
Our Approach:Our Approach: Use “ Use “LLookahead ookahead PProtocols” rotocols” (LP):(LP):main idea: main idea: anticipateanticipate critical events based on critical events based on richer observationricher observation
Two new protocol optimizations:Two new protocol optimizations: ““Early evaluation:”Early evaluation:”
give stage give stage head-starthead-start on evaluation by observing events on evaluation by observing events further down the pipelinefurther down the pipeline
(actually, a similar idea proposed by Williams in (actually, a similar idea proposed by Williams in PA0PA0,,but our designs exploit it much better)but our designs exploit it much better)
““Early done:”Early done:” stage signals “done” when it is stage signals “done” when it is about toabout to precharge/evaluate precharge/evaluate
14
Uses Uses “early evaluation:”“early evaluation:” each stage now has each stage now has twotwo control inputs control inputs
the new input comes from the new input comes from two stages aheadtwo stages ahead evaluate N as soon as N+1 evaluate N as soon as N+1 startsstarts precharging precharging
Observation:Observation: PC=0PC=0 soon aftersoon after Eval=1, Eval=1, and is persistentand is persistent use PC as safe use PC as safe “takeover” “takeover” for Eval!for Eval!
Solution:Solution: no change! no change!
Timing Constraint:Timing Constraint: PC=0PC=0 arrives arrives beforebefore Eval=1Eval=1 is de- is de-
assertedassertedsimple one-sided timing requirementsimple one-sided timing requirementother constraints as well… all easily satisfied in practiceother constraints as well… all easily satisfied in practice
Modified completion detectors needed:Modified completion detectors needed: DoneDone=1=1 when stage starts evaluating, and inputs when stage starts evaluating, and inputs
validvalid DoneDone=0=0 when stage starts precharging when stage starts precharging
asymmetric C-elementasymmetric C-element
CCDoneDone
ORORbitbit00
ORORbitbit11
ORORbitbitnn
++++++
PCPC
22
N+1 “early done”N+1 “early done”
11 22
44
LP2/2 ProtocolLP2/2 ProtocolCompletion detection occurs Completion detection occurs in parallelin parallel with evaluation/precharge: with evaluation/precharge:
N evaluatesN evaluates N+1 evaluatesN+1 evaluates
NN N+1N+1 N+2N+2
22
N+1 “early done”N+1 “early done”
33
33
N+2 “early done”N+2 “early done”
23
LP2/2 PerformanceLP2/2 Performance
11 22
3344
DETECTEVAL TT 22Cycle Time =Cycle Time =
LP2/2 savings over PS0: LP2/2 savings over PS0: 1 Evaluation + 1 Precharge1 Evaluation + 1 Precharge
Hybrid of LP3/1 and LP2/2.Hybrid of LP3/1 and LP2/2. Combines: Combines: early evaluationearly evaluation of LP3/1 of LP3/1 early doneearly done of LP2/2 of LP2/2
Derivative of LP2/1, adapted to single-rail:Derivative of LP2/1, adapted to single-rail:bundled-data: bundled-data: matched delaysmatched delays instead of completion instead of completion
detectorsdetectors
delaydelay delaydelay delaydelay
““Ack”Ack” to previous stages is to previous stages is “tapped off early”“tapped off early”once in evaluate (precharge), dynamic logic insensitive to input changesonce in evaluate (precharge), dynamic logic insensitive to input changes
27
PC and Eval are combined exactly as in LP3/1PC and Eval are combined exactly as in LP3/1
Inside an LPInside an LPSRSR2/1 Stage2/1 Stage
““done”done” generated by an generated by an asymmetric C- asymmetric C-element element
donedone=1=1 when stage evaluates, and when stage evaluates, and data inputs data inputs validvalid donedone=0=0 when stage precharges when stage precharges
We inherit a timing assumption from Williams’ We inherit a timing assumption from Williams’ PS0:PS0: Input (left) environment Input (left) environment must precharge reasonably fastmust precharge reasonably fast
Problem:Problem:If environment is If environment is stuck in precharge,stuck in precharge,
all pipelines (incl. PS0) will malfunction!all pipelines (incl. PS0) will malfunction!
Our Solution:Our Solution: Add a special Add a special robustrobust controller for 1 controller for 1stst stage stage
simply synchronizes input environment and pipelinesimply synchronizes input environment and pipeline delay critical events until environment has finished prechargedelay critical events until environment has finished precharge
Modular solution overcomes shortcoming of Williams’ PS0Modular solution overcomes shortcoming of Williams’ PS0
No serious throughput overheadNo serious throughput overhead real bottleneck is the slow environment!real bottleneck is the slow environment!
Designed/simulated FIFO’s for each Designed/simulated FIFO’s for each pipeline style pipeline style
Experimental Setup:Experimental Setup: design:design: 4-bit wide, 10-stage FIFO 4-bit wide, 10-stage FIFO technology:technology: 0.6 0.6 HP CMOS HP CMOS operating conditions:operating conditions: 3.3 V and 300°K 3.3 V and 300°K
32
Throughput
Design Mega items/sec Improvement (%)
PS0 420 -
LP3/1 590 40%
LP2/2 760 79%
LP2/1 860 102%
LPSR2/1 1208 188%
dual-raildual-rail
single-railsingle-rail
Comparison with Williams’ Comparison with Williams’ PS0PS0
LP2/1:LP2/1: >2X faster>2X faster than Williams’ PS0 than Williams’ PS0 LPLPSRSR2/1:2/1: 1.2 Giga items/sec1.2 Giga items/sec
33
Comparison: Comparison: LPLPSRSR2/1 vs. Molnar 2/1 vs. Molnar FIFO’sFIFO’s
LPLPSRSR2/1 FIFO:2/1 FIFO: 1.2 Giga items/sec 1.2 Giga items/secAdding logic processing to FIFO:Adding logic processing to FIFO:
simply fold logicsimply fold logic into dynamic gate into dynamic gate little overhead little overhead
Comparison with Molnar FIFO’s:Comparison with Molnar FIFO’s: asp* FIFO:asp* FIFO: 1.1 Giga items/sec 1.1 Giga items/sec
more complex timing assumptions more complex timing assumptions not easily not easily formalizedformalized
requires explicit latches, separate from logic!requires explicit latches, separate from logic!adding logic processing adding logic processing betweenbetween stages stages significant significant
overheadoverhead
micropipeline:micropipeline: 1.7 Giga items/sec 1.7 Giga items/sec two parallel FIFO’s, each only 0.85 Giga/sectwo parallel FIFO’s, each only 0.85 Giga/secvery expensive transition latchesvery expensive transition latchescannot add logic processing to FIFO!cannot add logic processing to FIFO!
Especially suitable for Especially suitable for fine-grain (gate-level) pipeliningfine-grain (gate-level) pipelining
Very high throughputs obtained:Very high throughputs obtained:– dual-rail:dual-rail: >2X improvement>2X improvement over Williams’ PS0 over Williams’ PS0– single-rail:single-rail: 1.2 Giga items/second1.2 Giga items/second in 0.6 in 0.6 CMOS CMOS
Use easy-to-satisfy, one-sided timing constraintsUse easy-to-satisfy, one-sided timing constraints
Robustly handle arbitrary-speed environmentsRobustly handle arbitrary-speed environments– overcome a major shortcoming of Williams’ PS0 pipelinesovercome a major shortcoming of Williams’ PS0 pipelines
Recent Improvement: Even faster single-rail pipeline Recent Improvement: Even faster single-rail pipeline (WVLSI’00)(WVLSI’00)