1 Clockless Computing Clockless Computing Montek Singh Montek Singh Thu, Sep 13, 2007 Thu, Sep 13, 2007
1
Clockless ComputingClockless Computing
Montek SinghMontek Singh
Thu, Sep 13, 2007Thu, Sep 13, 2007
2
Dynamic Logic Pipelines Dynamic Logic Pipelines (contd.)(contd.)
Drawbacks of Williams’ PS0 PipelinesDrawbacks of Williams’ PS0 Pipelines Lookahead Pipelines Lookahead Pipelines [Singh/Nowick 2000][Singh/Nowick 2000]
High-Capacity Pipelines High-Capacity Pipelines [Singh/Nowick 2000][Singh/Nowick 2000]
3
Drawbacks of PSO PipeliningDrawbacks of PSO Pipelining1.1. Poor throughput:Poor throughput:
long cycle time: 6 events per cyclelong cycle time: 6 events per cycle data “tokens” are forced far apart in timedata “tokens” are forced far apart in time
2.2. Limited storage capacity:Limited storage capacity: max only 50% of stages can hold distinct tokensmax only 50% of stages can hold distinct tokens data tokens must be separated by at least one data tokens must be separated by at least one
spacerspacer
My Research Goals My Research Goals have beenhave been: : address both address both
issuesissues still maintain very low latencystill maintain very low latency
4
Recent ApproachesRecent Approaches3 novel styles for high-speed async pipelining:3 novel styles for high-speed async pipelining:
MOUSETRAP Pipelines MOUSETRAP Pipelines [Singh/Nowick, TAU-00, ICCD-[Singh/Nowick, TAU-00, ICCD-01]01]
““Lookahead Pipelines”Lookahead Pipelines” (LP) (LP) [Singh/Nowick, Async-00][Singh/Nowick, Async-00] ““High-Capacity Pipelines”High-Capacity Pipelines” (HC) (HC) [Singh/Nowick, [Singh/Nowick,
WVLSI-00]WVLSI-00]
Goal:Goal: significantly improve throughput of PS0significantly improve throughput of PS0
Two Distinct Strategies:Two Distinct Strategies: LP: LP: introduceintroduce protocol optimizations protocol optimizations
““shave off”shave off” components from critical cycle components from critical cycle
HC: HC: fundamentally new protocolfundamentally new protocolgreater concurrency: “loosely-coupled” stagesgreater concurrency: “loosely-coupled” stages
5
OutlineOutline New Asynchronous Pipelines: New Asynchronous Pipelines:
MOUSETRAP PipelinesMOUSETRAP Pipelines LLookahead ookahead PPipelines (LP)ipelines (LP) HHigh-igh-CCapacity Pipelines (HC)apacity Pipelines (HC) Dynamic circuit styleDynamic circuit style
Static circuit styleStatic circuit style
6
Lookahead Pipeline StylesLookahead Pipeline Styles
Singh and NowickSingh and Nowick
Async-2000Async-2000[Best Paper Award][Best Paper Award]
7
Lookahead Pipelines: Strategy Lookahead Pipelines: Strategy #1#1Use non-neighbor communication:Use non-neighbor communication:
stage receives information stage receives information from from multiple later multiple later stagesstages
allows allows “early evaluation” “early evaluation”
Benefit:Benefit: stage gets stage gets head-starthead-start on next on next
cyclecycle
8
Lookahead Pipelines: Strategy Lookahead Pipelines: Strategy #2#2Use early completion detection:Use early completion detection:
completion detector completion detector moved before stagemoved before stage (not after) (not after) stage indicatesstage indicates “early done”“early done” in parallel with in parallel with
computationcomputation
Benefit:Benefit: again, stage gets again, stage gets head-starthead-start on on
next cyclenext cycle
early completion detectorearly completion detector
9
Lookahead Pipelines: OverviewLookahead Pipelines: Overview5 New Designs:5 New Designs:
““Dual-Rail” Data Signaling:Dual-Rail” Data Signaling: LP3/1:LP3/1: “early evaluation”“early evaluation” LP2/2:LP2/2: “early done”“early done” LP2/1:LP2/1: “early evaluation” + “early done”“early evaluation” + “early done”
““Single-Rail” Bundled-Data Signaling:Single-Rail” Bundled-Data Signaling: LPLPSRSR2/2:2/2: “early done”“early done”
LPLPSRSR2/1:2/1: “early evaluation” + “early done”“early evaluation” + “early done”
10
Optimization = Optimization = “early evaluation”“early evaluation” each stage has two control inputs: from stages N+1 and N+2each stage has two control inputs: from stages N+1 and N+2
Idea: Idea: shorten precharge phaseshorten precharge phase terminate precharge terminate precharge early:early: when N+2 is done evaluating when N+2 is done evaluating
Dual-Rail Design #1: Dual-Rail Design #1: LP3/1LP3/1
Datain
Dataout
PCPC EvalEval
From N+2From N+2From N+2From N+2
NN N+1N+1 N+2N+2
ProcessingBlock
CompletionDetector
11
LP3/1 ProtocolLP3/1 Protocol PRECHARGEPRECHARGE N:N: when N+1 completes when N+1 completes
evaluationevaluation EVALUATEEVALUATE N:N: whenwhen N+2N+2 completes completes
evaluationevaluation
New!New!
11 22 33
Enables “early evaluation!”Enables “early evaluation!”
44
N evaluatesN evaluates N+1 evaluatesN+1 evaluates
N+2 indicates “done”N+2 indicates “done”
N+2 evaluatesN+2 evaluates
NN N+1N+1 N+2N+2
N+1 indicates “done”N+1 indicates “done”
33
12
PS0PS0PS0PS0
LP3/1LP3/1LP3/1LP3/1
LP3/1: Comparison with PS0LP3/1: Comparison with PS0
55
44
4466
NN N+1N+1 N+2N+2
NN N+1N+1 N+2N+2
Enables “early evaluation!”Enables “early evaluation!”
11
11
evaluatesevaluates
evaluatesevaluates
22
22
evaluatesevaluates
evaluatesevaluates
33
33
evaluatesevaluates
evaluatesevaluatesOnly 4 events in cycle!Only 4 events in cycle!
6 events in cycle6 events in cycle
PRECHARGE N:PRECHARGE N: when N+1 when N+1completes evaluationcompletes evaluationPRECHARGE N:PRECHARGE N: when N+1 when N+1completes evaluationcompletes evaluation
33
indicates “done”indicates “done”
indicates “done”indicates “done”
33
EVALUATE N:EVALUATE N: when N+2 completes evaluation when N+2 completes evaluationEVALUATE N:EVALUATE N: when N+2 completes evaluation when N+2 completes evaluation
EVALUATE N:EVALUATE N: when N+1 completes precharging when N+1 completes prechargingEVALUATE N:EVALUATE N: when N+1 completes precharging when N+1 completes precharging
13
11 22 33
44
LP3/1 PerformanceLP3/1 Performance
DETECTEVAL TT 3Cycle Time =Cycle Time =
saved pathsaved path
Savings over PS0:Savings over PS0: 1 Precharge + 1 Completion Detection1 Precharge + 1 Completion Detection
14
LP3/1: Inside a StageLP3/1: Inside a Stage
Timing Issues:Timing Issues: must satisfy several simple must satisfy several simple
constraintsconstraints Ex.:Ex.: PCPC must arrive must arrive beforebefore
Eval de-assertedEval de-asserted 1-sided timing requirement1-sided timing requirement easily satisfied in practiceeasily satisfied in practice
PC (From Stage N+1)PC (From Stage N+1)Eval (From Stage N+2)Eval (From Stage N+2)
NANDNAND
““early Eval”early Eval”
““old Eval”old Eval”Merging 2 Control Merging 2 Control Inputs:Inputs:
15
Dual-Rail Design #2: Dual-Rail Design #2: LP2/2LP2/2Optimization = Optimization = “early done”“early done”
Idea: move completion detector Idea: move completion detector beforebefore processing processing blockblockstage indicates whenstage indicates when “about to”“about to” precharge/evaluateprecharge/evaluate
ProcessingBlock
“early” Completion
Detector
Datain
Dataout
“early done”
16
LP2/2 Completion DetectorLP2/2 Completion DetectorModified completion detectors needed:Modified completion detectors needed:
DoneDone=1=1 when stage starts evaluating, and inputs valid when stage starts evaluating, and inputs valid DoneDone=0=0 when stage starts precharging when stage starts precharging
asymmetric C-elementasymmetric C-element
CCDoneDone
ORORbitbit00
ORORbitbit11
ORORbitbitnn
++++++
PCPC
17
11 22
44
LP2/2 ProtocolLP2/2 ProtocolCompletion Detection:Completion Detection:
performedperformed in parallel in parallel with evaluation/precharge of with evaluation/precharge of stagestage
N evaluatesN evaluates N+1 evaluatesN+1 evaluates
NN N+1N+1 N+2N+2
22
““early done”early done”of N+1 evalof N+1 eval
33
33
““early done”early done”of N+2 evalof N+2 eval
““early done”early done”of N+1 prechof N+1 prech
18
LP2/2 PerformanceLP2/2 Performance
11 22
3344
LP2/2 savings over PS0: LP2/2 savings over PS0: 1 Evaluation + 1 Precharge1 Evaluation + 1 Precharge
DETECTEVAL TT 22Cycle Time =Cycle Time =
19
Dual-Rail Design #3: Dual-Rail Design #3: LP2/1LP2/1Hybrid of LP3/1 and LP2/2.Hybrid of LP3/1 and LP2/2. Combines: Combines:
early evaluationearly evaluation of LP3/1 of LP3/1 early doneearly done of LP2/2 of LP2/2
DETECTEVAL TT 2Cycle Time =Cycle Time =
20
Lookahead Pipelines: OverviewLookahead Pipelines: Overview5 New Designs:5 New Designs:
““Dual-Rail” Data Signaling:Dual-Rail” Data Signaling: LP3/1:LP3/1: “early evaluation”“early evaluation” LP2/2:LP2/2: “early done”“early done” LP2/1:LP2/1: “early evaluation” + “early done”“early evaluation” + “early done”
““Single-Rail” Bundled-Data Signaling:Single-Rail” Bundled-Data Signaling: LPLPSRSR2/2:2/2: “early done”“early done”
LPLPSRSR2/1:2/1: “early evaluation” + “early done”“early evaluation” + “early done”
21
Single-Rail Design: Single-Rail Design: LPLPSRSR2/12/1Derivative of LP2/1, adapted to single-rail:Derivative of LP2/1, adapted to single-rail:
bundled-data: bundled-data: matched delaysmatched delays instead of completion instead of completion detectorsdetectors
delaydelay delaydelay delaydelay
““Ack”Ack” to previous stages is to previous stages is “tapped off early”“tapped off early”once in evaluate (precharge), dynamic logic insensitive to input changesonce in evaluate (precharge), dynamic logic insensitive to input changes
22
PC and Eval are combined exactly as in LP3/1PC and Eval are combined exactly as in LP3/1
Inside an LPInside an LPSRSR2/1 Stage2/1 Stage
““done”done” generated by an generated by an asymmetric C- asymmetric C-element element
donedone=1=1 when stage evaluates, and when stage evaluates, and data inputs data inputs validvalid donedone=0=0 when stage precharges when stage precharges
PC (From Stage N+1)PC (From Stage N+1)
Eval (From Stage N+2)Eval (From Stage N+2)
NANDNAND
aCaC++
““ack”ack”
““req” inreq” in
data indata in data outdata out
““req” outreq” out
matcheddelay
donedone
23
LPLPSRSR2/1 Protocol2/1 Protocol
11 22
33
aCEVAL TT 2Cycle Time =Cycle Time =
element-C asymmetric throughDelay aCT
N evaluatesN evaluates N+2 evaluatesN+2 evaluates
N+2 indicates “done”N+2 indicates “done”
NN N+1N+1 N+2N+2
22
N+1 evaluatesN+1 evaluates
N+1 indicates “done”N+1 indicates “done”
24
Throughput
Design Giga items/sec Improvement (%)
PS0 0.51 1
LP3/1 0.69 1.3
LP2/2 0.90 1.8
LP2/1 1.04 2.0
LPSR2/2 1.31 2.6
LPSR2/1 1.55 3.0
HC 1.75 3.4
dual-raildual-rail
single-railsingle-rail
FIFO Results (simulations)FIFO Results (simulations)
LP dual-rail: LP dual-rail: over 80% faster than Williams’ PS0 over 80% faster than Williams’ PS0 comparable latencycomparable latency
LP single-rail: LP single-rail: even fastereven faster
0.190.19 CMOS CMOS3.3 V, 300°K3.3 V, 300°K
25
datapath widthdatapath width= 32 dual-rail bits!= 32 dual-rail bits!
Practicality of Gate-Level Practicality of Gate-Level PipeliningPipeliningWhen datapath is wide:When datapath is wide:
Can often split into narrow Can often split into narrow “streams”“streams”
comp. comp. ddet. et. ffairly airly low cost!low cost!
Use Use “localized”“localized” completion detector completion detector for each stream:for each stream:
need to examine only a few bitsneed to examine only a few bits small fan-insmall fan-in
send “done” to only a few gatessend “done” to only a few gates small fan-outsmall fan-out
donedone
fan-out=2fan-out=2
comp. det.comp. det.fan-in = 2fan-in = 2
26
High-Capacity PipelinesHigh-Capacity Pipelines
Singh/Nowick Singh/Nowick WVLSI-00, ISSCC-02, Async-02WVLSI-00, ISSCC-02, Async-02
27
HCHC Pipeline Style Pipeline StyleHigh-Capacity Pipelines (HC)High-Capacity Pipelines (HC)
bundled datapaths; dynamic logic function blocksbundled datapaths; dynamic logic function blocks latch-free: no explicit latches neededlatch-free: no explicit latches needed
dynamic logic provides implicit latchingdynamic logic provides implicit latching novel highly-concurrent protocol novel highly-concurrent protocol maximizes storage maximizes storage
capacitycapacity traditional latch-free approaches: “spacers” limit capacity to traditional latch-free approaches: “spacers” limit capacity to
50%50%
Key Idea: Obtain greater control of stage’s operationKey Idea: Obtain greater control of stage’s operation separate control of pull-up/pull-downseparate control of pull-up/pull-down result = new result = new “isolate phase”“isolate phase” stage holds outputs/impervious to input changesstage holds outputs/impervious to input changes
Advantage: Each stage can hold a distinct data itemAdvantage: Each stage can hold a distinct data item 100% storage capacity100% storage capacity
Extra Benefit: Obtain greater concurrencyExtra Benefit: Obtain greater concurrency High throughputHigh throughput
28
HC: Basic StructureHC: Basic Structure
Key Idea:Key Idea:2 independent control 2 independent control signals:signals:pc: pc: controls prechargecontrols prechargeeval: eval: controls evaluationcontrols evaluation
Allows novel 3-phase cycle:Allows novel 3-phase cycle:
EvaluateEvaluate
““Isolate” (hold)Isolate” (hold)
Precharge Precharge
delaydelay
stagestagecontrollercontroller
pcpc evaleval
ackack
N N+1 N+2
delaydelay
Single-rail “Bundled Datapath”: Single-rail “Bundled Datapath”: matched delay: matched delay: produces delayed produces delayed “done” “done”
signalsignalworst-case delay: longer than slowest path worst-case delay: longer than slowest path
for datafor data
delaydelay
29
HC: Inside a StageHC: Inside a StageIndependent ControlsIndependent Controls of of pull-uppull-up and pull-down: and pull-down:
allows new 3allows new 3rdrd phase: “isolate” phase: “isolate”
pcpc asserted: asserted: prechargeprecharge evaleval asserted: asserted: evaluateevaluate pcpc and and evaleval de-asserted: enter de-asserted: enter “isolate” (hold) “isolate” (hold)
phasephase
“keeper”
controlscontrolsevaluationevaluation
controlscontrolsprechargeprecharge
evaleval
inputsoutputs
pcpc
30
HC: ProtocolHC: Protocol
Most Existing Protocols: Most Existing Protocols: 3 synchronization 3 synchronization
arcsarcs1 forward arc: 1 forward arc: data dependencydata dependency2 backward arcs: 2 backward arcs: control synchronizationcontrol synchronization
Our protocol: Our protocol: only 2only 2 synchronization arcssynchronization arcsonly 1 backward arconly 1 backward arc
once stage N+1 evaluates, N can complete entire next once stage N+1 evaluates, N can complete entire next cycle!cycle!
EvalEval
IsolateIsolate
PrechargePrecharge
pc=1pc=1eval=1eval=1
pc=1pc=1eval=0eval=0
pc=0pc=0eval=0eval=0
EvalEval
IsolateIsolate
PrechargePrecharge
Stage NStage N Stage N+1Stage N+1
X
31
Formal Specification of ControllerFormal Specification of Controller
Problem: Specification Problem: Specification too concurrenttoo concurrent for direct synthesis for direct synthesisdesired precharge condition: N and N+1 have evaluated desired precharge condition: N and N+1 have evaluated
same data same data problem: this condition not uniquely captured by given problem: this condition not uniquely captured by given
signals!signals!N may evaluate next data item,N may evaluate next data item, while N+1 stuck on current item!while N+1 stuck on current item!
T+T+
T-T-
(Evaluate of(Evaluate ofN+1 complete)N+1 complete)
(Precharge of(Precharge ofN+1 complete)N+1 complete)
pc+pc+ eval+eval+
S+S+
eval-eval-
pc-pc-
S-S-
(Start(Startevaluate)evaluate)
(Evaluate(Evaluatecomplete)complete)
(Isolate)(Isolate)
(Start(Startprecharge)precharge)
(Precharge(Prechargecomplete)complete)
32
Modified Specification of Modified Specification of ControllerControllerSolution: Add a state variable Solution: Add a state variable ok2pcok2pc
ok2pc ok2pc records whether N+1 has records whether N+1 has “absorbed”“absorbed” N’s data N’s data itemitem
ok2pcok2pc resets resets immediately when N deletes item immediately when N deletes item (N (N precharges)precharges)
ok2pcok2pc is set is set when N+1 deletes item when N+1 deletes item (N+1 precharges) (N+1 precharges)
ok2pc+ok2pc+
ok2pc-ok2pc-
pc+pc+ eval+eval+
S+S+
eval-eval-
pc-pc-
S-S-
T+T+
T-T-
(Evaluate of(Evaluate ofN+1 complete)N+1 complete)
(Precharge of(Precharge ofN+1 complete)N+1 complete)
33
Controller implementationController implementation
Controller implementation is very simple:Controller implementation is very simple: each signal implemented using each signal implemented using a single gatea single gateok2pcok2pc typically typically off the critical pathoff the critical path
INVINV
NAND3NAND3
aCaC++
SS
TT
SSTT
ok2pcok2pc
pcpc
evaleval SS
34
++
evalevalpcpc
HC: Stage ImplementationHC: Stage Implementation
reqreq donedone
ackack
NANDNANDINVINV
delaydelay
state variable:state variable: off the critical pathoff the critical path
from currentfrom currentstagestage
self-loop:self-loop: key to fastkey to fast “ “isolation”isolation”
from nextfrom nextstagestage
early ackearly ack
35
HC: OperationHC: Operation
11
NN N+1N+1N evaluatesN evaluates N+1 starts toN+1 starts to
evaluateevaluateN prechargesN precharges
N enables itself for next evaluationN enables itself for next evaluation
22
33
(fast(fastself-loop)self-loop)
N isolatesN isolates
(fast(fastself-loop)self-loop)
(early Ack)(early Ack)
Cycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delays
36
N enables itselfN enables itselffor next evaluationfor next evaluation
N prechargesN precharges
PerformancePerformance
11
)()( INVPRECHNANDaCEVAL TTTTT 3Cycle Time =Cycle Time =
N evaluatesN evaluates
NN N+1N+1 N+2N+2
N+1 evaluatesN+1 evaluates
33
22
N isolatesN isolates
22
37
Throughput
Design Giga items/sec Improvement (%)
PS0 0.51 1
LP3/1 0.69 1.3
LP2/2 0.90 1.8
LP2/1 1.04 2.0
LPSR2/2 1.31 2.6
LPSR2/1 1.55 3.0
HC 1.75 3.4
dual-raildual-rail
single-railsingle-rail
FIFO Results (simulations)FIFO Results (simulations)
LP dual-rail: LP dual-rail: over 80% faster than Williams’ PS0 over 80% faster than Williams’ PS0 comparable latencycomparable latency
LP single-rail: LP single-rail: even fastereven faster
0.190.19 CMOS CMOS3.3 V, 300°K3.3 V, 300°K