-
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)
SYSTEMS, VOL. 22, NO. 7, JULY 2014 1593
Novel Class of Energy-Efficient Very High-SpeedConditional
PushPull Pulsed Latches
Elio Consoli, Gaetano Palumbo, Fellow, IEEE, Jan M. Rabaey,
Fellow, IEEE,and Massimo Alioto, Senior Member, IEEE
Abstract In this paper, a new class of pulsed latches
isintroduced and experimentally assessed in 65-nm CMOS.
Itsconditional pushpull pulsed latch topology is based on a
pushpull final stage driven by two split paths with a
conditionalpulse generator. Two circuit implementations of the
conceptare discussed, with their main difference being in the
pulsegenerator, which can be either shared (CSP3L) or not
(CP3L).Measurements show that the proposed topology is very fast,as
it outperforms the well-known transmission gate pulsedlatch (TGPL)
[1] by 1.52; hence the proposed pulsed latchhas the highest
performance ever reported. The proposed pulsedlatch is also shown
to significantly improve the energy efficiencycompared to the state
of the art. Indeed, a 2.3 improvement inED3 product (energy delay3)
over TGPL was found for designstargeting minimum ED3. For designs
targeting minimum ED,a 1.3 improvement was found in ED product.
This comes atthe cost of a 1.151.35 cell area penalty, which
translatesinto an overall area increase well below 1% in typical
systems.Measurements on 256 replicas confirm that the above
benefits arekept in the presence of variations. Accordingly, the
proposed classof pulsed latches goes beyond the current state of
the art and iswell suited for VLSI systems that require both high
performanceand energy efficiency.
Index Terms Clocking, energy efficiency, energy-delaytradeoff,
flip-flops (FFs), high speed, low power, nanometerCMOS, pulsed
latches, VLSI.
I. INTRODUCTION
FLIP-FLOPS (FFs) and latches are well known to beresponsible for
a large fraction of the power budget ofmicroprocessors and VLSI
systems [1][7]. Typically, theydissipate 80% of the total clock
power [5], and 30% of theoverall power budget [2]. Energy
efficiency of FFs and latchesis nowadays even more critical than in
the past, consideringthat speed can be increased only through
improvements inenergy efficiency, since VLSI systems are power
limited[2], [8], [9]. Therefore, the search for novel topologies
with a
Manuscript received February 12, 2013; revised July 11, 2013;
acceptedJuly 28, 2013. Date of publication September 9, 2013; date
of current versionJune 23, 2014.
E. Consoli is with Maxim Integrated Products, Catania 92100,
Italy (e-mail:[email protected]).
G. Palumbo is with the DIEEI, Universit di Catania, Catania
I-95125, Italy(e-mail: [email protected]).
J. M. Rabaey is with the Electrical Engineering and Computer
ScienceDepartment, University of California, Berkeley, CA 94720 USA
(e-mail:[email protected]).
M. Alioto is with the Electronics and Computer Engineering
Depart-ment, National University of Singapore, 117576 Singapore
(e-mail:[email protected]).
Color versions of one or more of the figures in this paper are
availableonline at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2013.2276100
Fig. 1. Pareto-optimal energy-delay curve of existing FF
topologies for atypical load of 16 minimum inverters (energy per
cycle and DQ delay arein arbitrary units).
(a) (b)
Fig. 2. (a) TGPL topology. (b) Pulse generator topology (area in
dashed lineis shareable among multiple cells).
targeted speed under a relatively low consumption (with
theirtradeoff quantified by composite Ei D j metrics [10][12])
iscrucial.
Among state-of-the-art topologies, pulsed latches
typicallyexhibit the best energy efficiency from moderate to
highperformance design targets, among the existing classes ofFFs
[10][15]. In particular, from moderate to very highperformance
targets, only very few topologies belong to thePareto-optimal curve
of designs having minimum energy fora given performance [10], [11].
As recalled in Fig. 1, thetransmission gate pulsed latch (TGPL) [1]
(see Fig. 2) usedin various Intel microprocessors is the most
energy-efficientFF in a rather wide portion of the Pareto-optimal
curve,ranging from high-speed (i.e., points with minimum E D
jproduct with j > 1) to energy-efficient designs (i.e.,
pointswith minimum ED). Only the skew-tolerant FF (STFF) is
able
1063-8210 2013 IEEE. Personal use is permitted, but
republication/redistribution requires IEEE permission.See
http://www.ieee.org/publications_standards/publications/rights/index.html
for more information.
-
1594 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)
SYSTEMS, VOL. 22, NO. 7, JULY 2014
to outperform transmission gate flip-flop (TGFF) for
extremelyhigh-speed design targets [16] (i.e., points with minimum
ED jfor j 5). In this region, the STFF speed advantage in termsof
DQ delay is typically about 10%, at the cost of a 2greater energy
[11]. Hence, although STFF is slightly betterthan TGPL in terms of
pure performance, but its significantlyworse energy efficiency does
not make it as competitive asTGPL in applications where energy
efficiency is a concern.Hence, in the following, TGPL will be
adopted as a referencefor high-speed energy-efficient designs. When
slower designslower design targets are considered, master-slave FFs
exhibitbetter energy efficiency. The traditional TGFF [17] and
therecently proposed Toshiba ACFF [18] are, respectively, themost
efficient among designs with balanced energy-delay (i.e.,minimum
ED) and ultralow energy designs (i.e., minimumE j D with j >
1).
In this paper, a novel class of pulsed latches
(conditionalpushpull pulsed latch) is introduced. The main idea is
toadopt a pushpull output stage, which is driven by two splitpaths
for rise and fall output transitions, with the explicit aimof
reducing both the path effort and the parasitic delay [19].In
addition, the capacitance at the output of the first stage
isfurther reduced by adopting half-latches in the split paths
andmoving the cross-coupled inverters to the output node.
Two versions are presented, respectively, without (CP3L)and with
(CSP3L) shareable conditional pulse generator. Mea-surements on a
65-nm test chip demonstrate 1.32.3 betterenergy efficiency compared
to TGPL, as well as 1.52DQ delay improvement even in the presence
of processvariations. The proposed pulsed latches have a
1.151.35larger area than TGPL, with a resulting increase in the
areaof practical VLSI systems that is well below 1%.
This paper is organized as follows. In Section II, therationale
behind the herein proposed novel topologies and theiroperation is
described, and their detailed circuit implementa-tion is discussed
in Section III. The potential speed advantagecompared to TGPL is
analytically evaluated in Section IV,and aspects related to
physical design and layout parasiticsare discussed in Section V.
Details on the 65-nm test chipand the adopted delay-energy testing
circuitry are provided inSection VI. Measurements results and
comparison with state-of-the-art topologies are discussed in
Section VII. Conclusionsare reported in Section VIII, and an
Appendix presents adetailed logical effort analysis.
II. CONDITIONAL PUSHPULL PULSED LATCH:MAIN IDEAS AND
OPERATION
In the proposed class of pulsed latches shown in Fig. 3,a
pushpull output stage is adopted (M7M8) as opposed tothe
traditional output inverter stage employed in most
existingtopologies [10][18] (see M5M6 in TGPL in Fig. 2). Such
atechnique allows for reducing the load of the driving circuitryby
a factor 23, thereby making it faster and more energy-efficient.
This also allows M7M8 in Fig. 3 to be up-sized,and hence have a
faster output stage.
The pushpull output stage in Fig. 3 is driven by twosplit paths
that generate the active-high R (active-low set S)
Fig. 3. General scheme of the proposed class of pulsed
latches.
pulsed signal, which resets (sets) the output when active.Pulses
R and S are alternatively generated to enable a fall/riseoutput
transition, respectively. These pulses are generated atthe falling
clock edge by the conditional pulse generator inFig. 3, and are
transferred to the output stage by either thehalf latch M1M3 or
M4M6, depending on whether input Dis, respectively, low or high
(see below for detailed descriptionof pulse waveforms). These half
latches in the first stage withinthe DQ critical path have less
parasitics compared to typi-cal clocked inverters or inverters with
cascaded transmissiongate [10][18] (see M1M4 in Fig. 2). The input
D drivestwo different paths, respectively, through an nMOS (M5)
anda pMOS (M2) transistor in Fig. 3, which is equivalent to theload
of a traditional input inverter stage (see M1M2 in TGPLin Fig.
2).
The operation of the scheme in Fig. 3 is explained in detailin
Fig. 4, which depicts the main waveforms of the internalsignals.
After the falling clock edge (cycle 1 in Fig. 4), thepulse
generator checks if the previous output1 QD in Fig. 3is high or
low. If previous output is QD = 1, next output Qcan stay at the
same value or make a falling transition, hencea pulse is generated
in the fall path in Fig. 3 through theactive-low signal CP f ,
whereas nothing changes in the risepath (active-high signal CPr is
kept low, thus latch M4M6keeps S high and maintains M8 OFF).
Subsequently, if inputstays at the previous value D = 1, the latch
M1M3 is notenabled; hence R is dynamically kept at the previous
valueR = 0 (then, it is statically tied to ground once the
pulseexpires). On the other hand, if input changes to D = 0,
thelatch M1M3 is enabled and the CP f pulse determines a highpulse
in R, which turns M7 ON and brings the output Q tolow. Afterwards,
its delayed output replica QD experiencesthe same transition.
If the previous output is QD = 0, right after the falling
clockedge (cycle 2 in Fig. 4), a pulse is generated in the rise
paththrough the active-high signal CPr (nothing changes in thefall
path). If input stays at the previous value D = 0, the latchM4M6 is
disabled and S is kept high, so that nothing changesin the rise
path. If input changes to D = 1, the latch M4M6 isenabled and the
CPr pulse pulls down S, thereby turning M8
1More precisely, the delayed version QD of the output Q is fed
back tothe conditional pulse generator. As explained below, feeding
back QD (ratherthan Q) permits to reduce the internal activity and
hence energy per cycle.
-
CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL
PUSHPULL PULSED LATCHES 1595
Fig. 4. Waveforms of internal signals of the general scheme in
Fig. 3.
Fig. 5. CP3L topology (area in dashed line is shareable among
multiplecells).
ON and bringing Q to high. Afterwards, the delayed outputreplica
QD experiences the same transition.
At the steady state, R (S) in Fig. 3 is set to 0 (1),
therebyturning OFF the output transistors M7M8, with the
outputbeing maintained at the desired value by a keeper. In
otherwords, the memory element within the proposed topology inFig.
3 is actually placed at the output node, as opposed to mostof the
existing topologies where it is placed before the outputstage (see
the gated cross-coupled inverter pair in Fig. 2, whichis connected
to the input of the output stage M5M6). Thispermits to move the
parasitics associated with the memoryelement to the output node,
thereby making the input nodeof the output stage lightly loaded,
and hence faster and moreenergy efficient.
III. IMPLEMENTATION OF THE CONDITIONALPUSHPULL PULSED LATCH
CONCEPT: CP3L AND
CSP3L TOPOLOGIES
As discussed above, the proposed class of pulsed latchin Fig. 3
tends to have a lightly loaded DQ criticalpath, thereby making it
potentially fast and energy-efficient.Such features can be
implemented in different ways. In thefollowing, we present two
versions, respectively, without(Section III-A) and with (Section
III-B) shareable pulsegenerator.
A. CP3L: Conditional PushPull Pulsed Latch
The schematic of CP3L topology is depicted in Fig. 5.The keeper
(M9M12 in Fig. 5) drives the output Q and com-prises a
cross-coupled inverter pair, whose forward inverteris gated to
avoid current contention with the output stageM7M8. Indeed, if R =
1 the pull-down M7 of the outputstage is ON and the pull-up network
of the keeper is OFFthrough M11. Analogously, if S = 0 the pull-up
M8 of theoutput stage is ON and the pull-down network of the
keeperis OFF through M10.
As an additional advantage brought by placing the keeperafter
the output stage rather than before, CP3L has lighter loadon its
critical path since the half latch M1M3 (M4M6) in thefirst stage
has to drive the single transistor M11 (M10). Also,since the two
pulses R and S are alternatively generated, eitherM10 or M11 in the
keeper are actually subject to transitionsof the gate terminal in a
given cycle. In contrast, the first stageof traditional topologies
must drive two transistors associatedwith the keeper, and both of
them are subject to transitions[10][18] (see transistors M11M12 in
Fig. 2, which loadtransistors M3M4 lying in the critical path).
This clearlyreduces the parasitic load of the first stage of CP3L
and reducesactivity at the keeper capacitances, thereby making the
firststage faster and potentially more energy efficient.
Regarding the pulse generator, it comprises a clockphase
generator, a pseudo-NAND for the fall path(M15M19 in Fig. 5), and a
pseudo-NOR gate for therise path (M20M24). Operation is summarized
in Fig. 6,which depicts the waveforms of the signals involved inthe
generation of the CP f and CPr pulses. Generally, thepseudo-NAND
(pseudo-NOR) gate sets signal CP f (CPr) high(low), since signals
CK(I )N and CK(I V ) (CK and CK(III)N )are complementary and thus
keep either transistor M18 orM19 (M20 or M21) ON. However, after
the falling clockedge, signals CK(I )N and CK
(I V ) (CK and CK(III)N ) are bothtemporarily high (low) due to
the transitions of the fourinverters within the clock phase
generator (in the example inFig. 6, each inverter is assumed to
have the same delay invfor simplicity). Accordingly, during the
time slot inv4inv inFig. 6, the pseudo-NAND temporarily sets CP f
low throughtransistors M15M17 if QD = 1 (otherwise, CP f
remainshigh). Similarly, during the time slot 03inv in Fig. 6,
thepseudo-NOR temporarily sets CPr high through transistorsM22M24
if QD = 0 (otherwise, CP f remains low). Hence,
-
1596 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)
SYSTEMS, VOL. 22, NO. 7, JULY 2014
Fig. 6. Clock phase generator and waveforms defining CPr and CP
f pulses.
the clock phase generator and the pseudo-NAND/NOR gatesimplement
a conditional pulse generator, which alternativelyproduce a pulse
on either CP f or CPr , as determined by theprevious output value
QD . The clock phase generator can beshared among multiple latches
to amortize its overhead.
It is useful to observe that the width of CP f and CPr
pulsesdetermines the width of the transparency window of CP3Llatch
in which the input can affect the output. From a designpoint of
view, the width of the transparency window can bemodified by
changing the delay of the inverters within theclock phase generator
in Fig. 5. The effect of process variationson timing can be
compensated through post-silicon tuning ofthe pulse width, possibly
sharing the tuning circuitry amongmultiple latches [1], [20], [21].
In this paper, no tune-abilityis added to the considered pulsed
latches since the additionof such feature would impact area/energy
of any pulsed latchequally. Indeed, almost all existing pulsed
latches adopt thesame pulse generator topology (e.g., cascaded
inverters as inFigs. 2, 5, and 6) [10], [11].
The delay stage in the feedback path in Figs. 35 generatesa
delayed replica QD of the output Q, and is implemented bythe two
inverters M13M14 and M25M26 in Fig. 5. Actually,only slow
transistors M25M26 are added to implement suchdelay, as the
inverter M13M14 is already available (i.e.,M13M14 are used to both
latch and delay the output).This delay stage makes sure that QD is
kept stable at its previ-ous value during the transparency window,
thereby preventingglitches in CPr and CP f and reducing dynamic
energy, asdiscussed in the following.
Without the delay stage, the output Q would be connecteddirectly
to the pseudo-NAND/NOR in Fig. 5, hence any out-put transition
within the transparency window immediatelytriggers the generation
of an additional (undesired) pulse.As shown in detail in Fig. 7,
which refers to the casewhere Q is directly connected to the
pseudo-NAND/NOR, afalling transition of Q following the same input
transitionimmediately triggers a high pulse in CPr , as the
pseudo-NOR in Fig. 5 temporarily has all pMOS transistors M22M24 ON
during the transparency window (i.e., the CPr timeslot in Fig. 6).
Observe that this glitch in CPr pulse increasesthe dynamic energy,
but it does not affect correct operation.Indeed, if previous output
was Q = 1 and the current input isD = 0 as in Fig. 7, the CPr
glitch cannot propagate through
`
D
CK
CPr
CPf
R
QQ = 1
S
D = 0
R = 1
cycle 1
Q = 0
glitch on CPr
Fig. 7. Glitch in CPr occurring if no delay stage is inserted in
the feedbackpath in Figs. 35.
the half latch M4M6 since M5 is OFF. On the other hand,if the
previous output was Q = 1 and the current input isD = 1, the CPr
glitch propagates through the half latchM4M6 and temporarily sets S
= 0, but it does not affectthe output anyway since the latter is
kept at the desired valueQ = 1 through M8. Dual considerations hold
for glitches inCP f when no delay stage is inserted. As a result,
the delaystage in Figs. 35 is not strictly necessary, but its
insertionreduces the activity in CPr and CP f and hence energy.
B. CSP3L: Conditional Shareable PushPull Pulsed LatchIn CP3L,
the pulse generator cannot be shared among
multiple latches since pseudo-NOR/NAND are driven by QD ,which
is different for each latch. In this subsection, we presenta
different implementation of the same concept by integratingthe
conditional logic in the latch so that the whole pulsegenerator can
be shared. The resulting conditional shareablepushpull pulsed latch
(CSP3L) topology is depicted in Fig. 8.
In CSP3L, static NAND/NOR gates are introduced in theshareable
pulse generator to generate the pulses CPf,ext andCPr,ext that are
distributed to multiple latches and have thesame role as CP f and
CPr had in CP3L. In each latch, suchexternal pulses are enabled
through the switches implementedby M16M22 in Fig. 8, which
implement the conditionalpulse selection logic. The latter
comprises two transmissiongates and two small keepers to maintain
the same operationas before. As discussed above, the delay stage
M23M26 isintroduced in the feedback path (two more than CP3L
sincethe transmission gates need complementary control signals).The
resulting transistor count is the same as CP3L, henceCSP3L area is
expected to be roughly the same as CP3L(excluding the shareable
part).
Since CSP3L is based on the same concept as CP3L, oper-ation is
very similar. The main difference is in the conditionalpulse
selection logic, which enables the propagation of eitherCPf,ext or
CPr,ext to the half latches, according to the valueof the delayed
output replica QD . In particular, if QD = 1(QD = 0) the fall
(rise) path is activated, as the transmissiongate M15M16 (M19M20)
transfers the CPf,ext (CPf,ext)
-
CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL
PUSHPULL PULSED LATCHES 1597
Qn,DQn
R
D
D
Q
CPf
S
S
R
M1
M2M3
M4M5
M6 M7
M8
M9
M10
M11
M12
M25
M26delay
outputstage
halflatches
conditionalpulse selection
M13M14
FALLPATH
RISEPATH
QD
QD
Qn,D
QD
CPr,ext
CPf,ext
CPr
CPf,ext
CPr,ext
M15M16
M17 M18
M19M20
M21M22
M23
M24
shareable pulse generator
CK
CK(IV)
CKn(III)
CKn(I)
CKCKn(III)
CKn(I)
CK(IV)
CPr,ext
CPf,ext
Fig. 8. CSP3L topology (area in dashed line is shareable among
multiplecells).
pulse to the input of the half latch M1M3 (M4M6), similarto the
pseudo-NAND (pseudo-NOR) of CP3L in Fig. 5. As aminor difference
from CP3L, the input capacitance seen fromCPf,ext and CPr,ext in
CSP3L depends on Q, which may leadto data-dependent clock skew (see
Fig. 8). In practical cases,this is not a concern considering that
pulsed latches inherentlytolerate a significant amount of skew.
IV. ANALYSIS OF SPEED POTENTIALIn this section, CP3L and CSP3L
are comparatively evalu-
ated to TGPL in terms of maximum achievable performancethrough
logical effort analysis [22]. According to the analysisunder the
assumptions in the Appendix, the minimum DQdelay normalized to the
technology-dependent constant [22]for the CP3L, CSP3L, and TGPL
topology is
Dmin,CP3L Dmin,CSP3L
43
CLCin
+ 53
(1)
Dmin,TGPL
53
CLCin
+ 349
(2)
where CL and Cin are, respectively, the load and the
inputcapacitance of the pulsed latch. From (1)(2), CP3L andCSP3L
have basically the same minimum DQ delay, as isexpected by
considering that they have the same DQ criticalpath (M1M8 in Figs.
5 and 8).
From (1)(2), CP3L and CSP3L are always faster thanTGPL. Their
theoretical maximum speed advantage is about2.3 and is obtained at
light loads (i.e., electrical effort
CL /Cin 1). For typical electrical efforts ranging from10 to 30,
the potential speed advantage is 1.41.5, anddecreases to 1.3 for 60
or more. Although this analysis doesnot account for wire
parasitics, which will be included in thenext section, it suggests
that the potential advantage of CP3Land CSP3L over TGPL typically
ranges from 1.4 to 2.
The above speed improvement is justified by the lighterload of
the stages lying in the critical path, as was discussedin detail in
the previous section. Logical effort analysis inthe Appendix
permits to quantify the advantages of CP3L andCSP3L in each
critical path stage. Comparison of (A1)(A5)and (A2)(A7) clearly
shows that CP3L and CSP3L have aspeed advantage over TGPL both in
the first and second stage.In particular, the first stage has 1.25
lower logical effortand 2 lower parasitic delay thanks to the
lighter loadingeffect of parasitics, compared to TGPL. In addition,
the secondstage has 1.5 lower logical effort thanks to the
pushpullconfiguration.
The potential performance improvement enabled by CP3Land CSP3L
is kept also in the presence of layout parasitics,as will be
discussed in the next section, and can be traded offfor
significantly lower energy at iso-performance, as will
bedemonstrated in Section VII.
V. LAYOUT-AWARE SIZING METHODOLOGY ANDPHYSICAL-LEVEL
CONSIDERATIONS
Although the analysis in the previous section shows aclear
advantage of CP3L and CSP3L in terms of maximumperformance, in
practical cases transistors are optimized tohave a reasonable
balance with energy. To explore the energy-delay tradeoff in
meaningful design cases, the pulsed latcheswere sized to minimize
the energy-delay ED j figure of merit[10], [11]. The resulting
designs are energy optimal, in thesense that they belong to the
Pareto-optimal energy-delaycurve [23]. In the following, we focus
on the design pointswith minimum ED and ED3, which are,
respectively, repre-sentative of applications targeting balanced
energy-delay andhigh speed [10], [11].
Layout parasitics are well known to be comparable to(or
dominate) device parasitics even in internal nodes ofstandard
cells, and hence they have a considerable impacton the optimal
transistor sizes that minimize energy for agiven performance
constraint [24]. However, most of theprevious work on FF/latch
sizing and comparison presentssub-optimal designs in terms of
energy efficiency, as theyincludes layout parasitics only after
transistor are actuallysized [13], [17], [25][28]. In contrast, in
this paper, layoutparasitics are explicitly included into the
circuit design loopby resorting to the layout-aware sizing
methodology proposedby the same authors in [23]. In short, a
preliminary layoutorganization was setup in the form of stick
diagram, thenlayout parasitics were estimated from the stick
diagram as afunction of transistor sizes based. Successively,
transistor sizeswere optimized for the targeted energy-delay figure
of meritby including estimated layout parasitics in the
optimization.
Fig. 9(a)(c) show the layout of a TGPL, CP3L, and CSP3Lfor a
minimum energy-delay target and a load of 16 minimum-sized
inverters. In these figures, the dashed line defines the
-
1598 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)
SYSTEMS, VOL. 22, NO. 7, JULY 2014
(a)
(b)
(c)
Fig. 9. Layout under sizing for minimum ED. (a) TGPL. (b)
CP3L.(c) CSP3L. (Area in dashed line is shareable among multiple
cells.)
TABLE IAREA COMPARISON (65 nm, STD CELL HEIGHT: 3.9 m)
portion that can be shared and amortized among multiplelatches
(i.e., the pulse generator). These figures show thatthe shareable
portion occupies a significant fraction of theoverall latch area.
According to data in Table I, TGPL isconfirmed to have a very low
area, as is well known fromthe comparison with other existing
topologies [10], [11].In particular, TGPL offers a 1.15 (1.28) area
reduction overCP3L (CSP3L), when pulse generator is not shared.
When thelatter is shared, TGPL has a 1.33 (1.35) area reduction
overCP3L (CSP3L), which is similar to the advantage that TGPLhas
compared to most of the area-efficient existing topologies[10],
[11]. As a numerical example, in typical microprocessorsystems,
latches typically occupy less than 5% of the totalarea2 hence the
adoption of CP3L (CSP3L) as replacement ofall FFs would lead to
1.6% (1.7%) area increase comparedto TGPL. In practical cases, the
area increase resulting fromthe adoption of CP3L or CSP3L is well
below 1%, sincemore traditional topologies are typically used in
noncriticalpaths [28].
2This estimate is based on the consideration that the cache
memory takesup more than 50% of the area of todays processors [29],
and flip-flops/latchestypically account for 1% of the total gate
count of a core [30], with theirsize being in the same order of
magnitude as other cells.
Scan
Cha
in (S
C)
384:
1
Fig. 10. Architecture of the testchip.
Finally, post-layout parasitic extraction on different
designpoints showed that the intracell wire parasitics at the
outputof the first and second stage of CP3L and CSP3L are
verysimilar (within few percents) to those of TGPL for a
givenenergy-delay target. In other words, the qualitative
energy-performance advantages expected from Section III and
thequantitative performance benefits estimated in Section IV
areexpected to hold in practical cases where layout parasitics
areincluded (details will be provided in Section VII).
VI. ON-CHIP TESTING HARNESSA test chip in 65-nm CMOS was
prototyped to validate
the proposed class of pulsed latches. The test chip
containsCP3L, CSP3L, and TGPL latches implemented in two ver-sions,
respectively, targeting minimum ED and minimum ED3design, with a 16
and 64 load (1 is equivalent to theinput capacitance of a
minimum-sized symmetric inverter).Each latch version was
implemented in 64 replicas. To the bestof our knowledge, this is
the first paper that validates novellatch topologies through
measurements of multiple replicas.
The architecture of the test chip is shown in Fig. 10.A scan
chain (SC) scans in the test settings and scans outmeasurement
data. A test controller (TC) manages and appliessettings throughout
the latch arrays and steers measurementdata. For each measurement,
TC triggers the generation ofa pulse with appropriate width through
the pulse generator,and the delay generator (DG) generates delayed
versions oftest signals, whose delay can be tuned with a 1.8-ps
step(its use is discussed below). Each latch (test unit in Fig.
10)is excited after a preliminary coarse selection to reduce
thenumber of cells switching at the same time, and then a
finalselection to steer measurements to SC. The testing
harnessmeasures timing parameters and energy of the above
latches.Two different arrays are used to measure energy and
delay,so that the reconfigure-ability (i.e., parasitics) added to
bettermeasure one of the two parameters does not interfere with
theother. The die photo is in Fig. 11.
The energy is measured through an external 1-V pin sup-plying
power to nine replicas. The activity factor can be tunedamong the
following values, which widely cover practicalapplications: 6.25%,
12.5%, 25%, 50%, and 100%. To separatedynamic and leakage energy
components, leakage of a bank of
-
CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL
PUSHPULL PULSED LATCHES 1599
TABLE IIAVERAGE AND STANDARD DEVIATION OF MAIN PARAMETERS OF
INTEREST (256 REPLICAS, MIN.-E D DESIGN)
TABLE IIICOMPARISON WITH STATE OF THE ART
latches is measured through a separate pin. Regarding
timingparameters, the testing harness proposed in [31] has
beenimplemented to measure DCK, CKQ, and DQ delay ofeach latch
replica. Each delay is measured as difference ofthe arrival times
of the involved signals (e.g., D and CK forDCK delay). The test
unit for timing characterization of asingle latch is schematized in
Fig. 12, which shows that D,CK, and Q of the latch under test are
MUXed and captured bya master slave (transmission-gate) flip-flop
clocked by clockCKMS. The arrival time of each signal relative to
CKMS edgeis measured by sweeping the tunable delay between it
andCKMS, and checking when the capturing FF starts failing
[31](beyond that point, data is incorrectly captured for
greater
Fig. 11. Die photo.
Fig. 12. Block diagram of test unit for timing
characterization.
delays). For example, DCK delay can be measured accordingto the
following steps [31].
1) Select CK through MUX, sweep tunable delay betweenCK and
CKMS. Beyond a certain delay, CK is capturedincorrectly. The delay
that was applied immediatelybefore defines the arrival time of CK
relative to CKMS.
2) Repeat same steps by sweeping delay between D andCKMS. This
defines arrival time of D relative to CKMS.
3) Evaluate DCK delay as the difference between the twoarrival
times.
Differently from [31], we characterized an array of replicasby
using a single DG to test all replicas, to enable full-timing
characterization of the pulse generator. To keep layoutparasitics
small and have good control of the latch load, eachreplica has its
own local measurement unit (see Fig. 12) andthey are placed next to
each other.
-
1600 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)
SYSTEMS, VOL. 22, NO. 7, JULY 2014
(a)
(b)
Fig. 13. Setup characteristic: CKQ and DQ delay versus CK-D of
latchesdesigned for (a) minimum ED (16 load) and (b) minimum ED3
(64 load).
VII. MEASUREMENT RESULTSThe above-described testing harness was
used to character-
ize CP3L, CSP3L, and TGPL latch.
A. Performance, Hold Time Robustness, and EnergyThe measured
setup curves (i.e., CKQ and DQ delay
versus CKD) are reported in Fig. 13(a) and (b), respec-tively,
for the design targeting minimum ED and ED3. Themeasured replica is
representative of the nominal corner inthis process. From Fig. 13a
(13b), CP3L and CSP3L havevery similar minimum DQ delay, as
expected. DQ delay ofCP3L (CSP3L) is 17.3 ps (17.9 ps) for
minimum-ED sizing,while it is 15.6 ps (16.1 ps) for minimum-ED3.
From thesame figures, the TGPL latch under the same
conditions,respectively, achieves 34.6 and 24 ps. Accordingly, the
TGPLis slower than CP3L (CSP3L) by 2.03 (1.92) for theminimum ED
design, and 1.54 (1.47) for minimum ED3.This is particularly
interesting, considering that TGPL is wellknown for being the
fastest existing topology among thosewith reasonably high energy
efficiency [11] (and only 10%slower than the very fastest).
The hold time of CP3L (CSP3L) results to 90.5 ps (99.3 ps)for
minimum-ED design, and 123.6 ps (130.1 ps). On theother hand, TGPL
has a hold time of 121.9 ps and 171.1 ps
Fig. 14. Energy per cycle versus data activity.
for the minimum-ED and minimum-ED3 design, respectively.Hence,
CP3L and CSP3L have a hold time that is slightlybetter than TGPL
(by about 1.3) for both the minimum-EDand minimum-ED3 design.
The transient energy per cycle ETRAN (i.e., dynamic
andshort-circuit) is plotted in Fig. 14 versus data activity. This
fig-ure shows that energy of CP3L and CSP3L is from 40% to
60%higher than TGPL depending on the specific activity.
Energyitself is clearly not representative of energy efficiency,
asit should be evaluated as iso-performance. The
energy-delaytradeoff of the above topologies for the different
design targetsis depicted in Fig. 15(a), which shows that the
minimum-ED CP3L and CSP3L (which are once again very close toeach
other) is even faster and consumes less energy thanthe minimum-ED3
TGPL. More quantitatively, the energy ofCP3L, CSP3L, and TGPL for
25% data activity is, respectively,42, 41.5, and 26.1 fJ for
minimum-ED energy, hence CP3L andCSP3L exhibit a 1.3 better
energy-delay product compared toTGPL. For minimum-ED3 design, the
energy of CP3L, CSP3L,and TGPL is 73.7, 75.7, and 46.1 fJ, hence
CP3L and CSP3Limprove ED3 by 2.3, compared to TGPL. From Fig.
14,similar or better energy efficiency is expected at other
realisticvalues of data activity. The energy improvement enabledby
CP3L and CSP3L is intuitively explained by consideringthat these
topologies are significantly faster than TGPL (seeSection IV).
Hence, CP3L and CSP3L tend to have smallertransistor sizes for a
given performance target, which in turntranslates into smaller
dynamic and leakage energy comparedto TGPL.
Leakage can also be a concern in FF and latches, forexample, in
VLSI systems operating in standby mode whileretaining information
in registers and power gating all othergates [32]. The leakage
current under equiprobable inputs forCP3L, CSP3L, and TGPL is 316,
401.6, and 424.6 nA, respec-tively, for a minimum-ED design. As
shown in Fig. 15(b),this translates into a more favorable
leakage-delay trade-off, with a 2.7 improvement in the
leakage-delay product.For minimum-ED3 design, leakage of CP3L,
CSP3L, andTGPL is 561.7, 685.7, and 832.5 nA, which translates
intoa 5.4 improvement in the leakage-delay3 product.
-
CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL
PUSHPULL PULSED LATCHES 1601
(a)
(b)
Fig. 15. (a) Energy-delay tradeoff. (b) Leakage-delay
tradeoff.
B. Variations and Comparison With the State of the ArtThe above
measurements were repeated on 256 replicas
of each version of the considered pulsed latches (over
fourdice). As an example, Fig. 16 reports the resulting histogramof
the DQ delay for the CP3L in its minimum-ED andminimum-ED3 versions
(CSP3L histograms are very similar).The variability of these and
other parameters of interest issummarized in Tables II and III for
the minimum ED andED3 designs.
From Tables II and III, CP3L and CSP3L have 1.7lower standard
deviation of DQ delay compared to theTGPL in minimum-ED design,
whereas there is no signif-icant difference in the minimum-ED3
case. The compara-ble or smaller variations in CP3L and CSP3L
translate intoa 1.4 worse variability / compared to TGPL, due tothe
much lower average delay of CP3L and CSP3L. Thisvariability
difference does not significantly affect the abovementioned speed
advantage of CP3L and CSP3L. Indeed,the 3-sigma worst case value of
their DQ delay is betterthan the TGPL counterpart by 1.4 to 1.9
dependingon the design target. This is close to the above results
innominal corner (1.52). Hence, CP3L and CSP3L areconfirmed to be
largely faster than TGPL in the presenceof variations.
(a)
(b)
Fig. 16. Histogram of CP3L D-Q delay for (a) minimum-ED design
and(b) minimum-ED3 design (256 measurements).
CP3L and CSP3L have approximately the same variabil-ity as TGPL
in regard to setup time and leakage fromTables II and III. On the
other hand, CP3L and CSP3L havesimilar or 2 worse variability of
CKQ delay, comparedto TGPL. From the perspective of VLSI systems
timing,the above-discussed DQ delay variations are more
impactfulthan CKQ delay variations. Indeed, from Tables II and
III,CKQ variations are smaller than DQ delay variations.
Inaddition, critical paths typically go through a DQ delay,
ratherthan CKQ delay (late computations are finished during
thetransparency window). As expected, energy variations werefound
to be extremely small (1%), hence related resultsare omitted for
brevity. From Tables II and III, CP3L andCSP3L also have 1.72.6
less variations in hold time,which translates into a proportionally
lower number of buffersinserted by place and route tools at the
timing closure designphase.
For completeness, the proposed class of pulsed latches wasalso
compared to other existing topologies that cover a muchwider range
of applications, from very high performance tovery low energy. In
addition to TGPL, we thus consideredSTFF for its very high
performance [16], TGFF for its highenergy efficiency at moderate
performance [17], and ACFFfor its high energy efficiency at low
performance targets [18].The results of the comparison are
summarized in Table IV,where data are normalized to the best, and
the results from
-
1602 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)
SYSTEMS, VOL. 22, NO. 7, JULY 2014
TABLE IVAVERAGE AND STANDARD DEVIATION OF MAIN PARAMETERS OF
INTEREST (256 REPLICAS, MIN.-E D3 DESIGN)
simulations are based on post-layout extraction. From thistable,
the proposed class of pulsed latches exhibits the lowestDQ delay,
as it is 1.5 lower than that of the very fastSTFF (which is
confirmed to be slightly faster than TGPLfor minimum-ED3 design).
Also, CP3L and CSP3L largelyimprove energy efficiency at
high-performance design targets(i.e., minimum ED3), compared to
such high-speed topologies.Indeed, CP3L and CSP3L reduce ED3 by 2.2
and 2.8compared to TGPL and STFF, respectively. CP3L and
CSP3Lexhibit a significantly better energy efficiency even
comparedto topologies that are typically used for moderate to
low-speed design targets. More specifically, from Table IV CP3Land
CSP3L designed for minimum-ED improve the energy-delay product by
1.4 compared to the energy-efficient TGFFtopology, and by 1.9
compared to the ultralow energy ACFF.
Summarizing, the proposed class of pulsed latches out-performs
the state of the art in terms of pure performance,with DQ delay
improvements in the order of 1.5 or more.In current power-limited
VLSI systems, the more exploitableadvantage of CP3L and CSP3L is
their high energy efficiency,as they outperform the state of the
art by more than 2when compared to topologies targeting high speed.
In addition,the proposed pulsed latches exhibit a better energy
efficiency(1.41.9) even when compared to topologies targeting
verylow energy.
VIII. CONCLUSIONIn this paper, a new class of pulsed latches has
been
introduced. Its pushpull final stage and split paths in thefirst
stage enable a significant reduction in path and parasiticeffort.
Measurements on 65-nm test chip demonstrated a1.52 speed
improvement compared to TGPL, which makesthe proposed topologies
the fastest ever reported. At thebest of authors knowledge, for the
first time the proposedlatches are validated through measurements
on 256 replicas.Measurements confirm the above advantages in the
presenceof variations.
More importantly, the energy efficiency of the proposedpulsed
latches enables a significant improvement beyond thestate of the
art. Indeed, a 2.3 improvement over TGPL wasfound in terms of ED3
product, and a 1.3 improvement in theED product. The area penalty
paid by the proposed latches is1.15 1.35 compared to TGPL, which is
among the small-
est existing latches. The proposed pulsed latches also exhibita
better energy efficiency (1.41.9) compared to state-of-the-art
topologies that target ultralow energy operation.
Finally, the CP3L and CSP3L were shown to be equivalentin terms
of energy and performance, hence both topologiesare equally worth
considering when designing highly energy-efficient systems. The
choice between CP3L and CSP3Lis driven by preliminary design
decisions on the clockingscheme. Indeed, CP3L does not allow for
sharing a pulsegenerator, but has lower area than CSP3L if the
pulse generatoris included. Hence, CP3L is preferable when only a
smallsubset of FFs needs to be replaced by a pulsed latch (i.e.,
inpipeline stages that have few critical paths, as might occur
inrandom logic). Indeed, in this case latches tend to be far
fromeach other, hence it does not make sense to share their
pulsegenerator. On the other hand, CSP3L is preferable in
systemswhere a significant number of FFs need to be replaced
(e.g.,pipeline stages with many critical paths, as occurs in
regularmodules).
APPENDIXIn this appendix, transistor sizing for maximum speed
(i.e.,
minimum DQ delay) is analytically discussed for CP3L,CSP3L, and
TGPL. In the following, all transistor chan-nel widths are
normalized to the minimum allowed by thetechnology, PN ratio is
equal to two, and capacitances arenormalized to the input
capacitance of a minimum symmetricinverter (about 0.3 fF at 1 V in
this technology). Also,transmission gates were sized with equally
sized pMOS andnMOS transistors, keeper transistors are all minimum
sizedto reduce dissipation, channel lengths are generally
minimum,and series transistors are equally sized. Different sizing
(e.g.,non-minimum channel length) and PN ratio was allowed inthe
pulse generator to adjust the transparency window width,while
ensuring pulses with symmetrical rise/fall time.
A. Logical Effort Optimization of TGPLThe transistor sizes of
TGPL to be optimized under the
above assumptions are reported in Fig. 17. From this figure,the
two independent sizes W1 and W2 need to be optimizedin the critical
path.
From Fig. 17, the first stage of the critical DQ pathcomprises
the input inverter and the subsequent transmission
-
CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL
PUSHPULL PULSED LATCHES 1603
Fig. 17. Transistor sizes in TGPL.
gate. From usual logical effort calculations, the timing ofthe
first stage is characterized by the following logical
effortparameters:
g1 = 53 (A.1a)
h1 = 3W2 + 43W1 (A.1b)
p1 = 259 . (A.1c)Similarly, the second stage is a simple
inverter and hence hasparasitic delay p2 = 1 and logical effort g2
= 1, whereas itselectrical effort immediately results to
h2 = WL3W2 (A.2)
being WL the load expressed as equivalent transistorwidth [22]
(i.e., the transistor width such that its gate capequals the load
capacitance CL in Fig. 17), normalized tothe minimum channel width.
By setting g1h1 = g2h2, andneglecting the small contribution of
minimum-sized transistorsconnected to the output of the first stage
(i.e., transistors withnormalized width equal to 1 in Fig. 17),
from (A.1) to (A.2)the optimum W2 that minimizes the DQ delay
results toW2 = (WL W1/5)1/2. By substituting W2, (A.1)(A.2) leadto
the following minimum achievable delay:
Dmin,TGPL =
53
3W2 + 43W1
WL3W2
+ 259
+ 1
59
WLW1
+ 349
=
53
CLCin
+ 349
(A.3)
where we considered that the input capacitance Cin in Fig. 17is
equal to the gate capacitance of a transistor with width 3W1,and
the load capacitance CL is by definition the gate capaci-tance of a
transistor with width WL .
Finally, the detailed pulse generator sizing is very simpleand
herein omitted, as transistors of the output NAND gatein Fig. 2
must be simply sized to ensure the targeted slope(i.e., rise/fall
time) of signal CP. Commonly adopted valuesof the clock slope range
from FO3 to FO4 [3], being FO Xthe slope of the output waveform of
an inverter loaded by Xinverters with the same size. Subsequently,
inverters are easilysized to obtain the targeted transparency
window.
Fig. 18. Transistor sizes in the critical path of CP3L and
CSP3L.
B. Logical Effort Optimization of CP3L and CSP3LThe critical DQ
path of CP3L and CSP3L (they both
have exactly the same path) with the related transistor
sizesunder the above assumptions is shown in Fig. 18. From
thisfigure, CP3L and CSP3L have two independent DQ pathswith two
stages: the first stage is a half latch (top latch forfall path;
bottom for rise), and the second is a transistor ofthe second
pushpull stage (nMOS for fall path; pMOS forrise).
For the fall path, logical effort analysis leads to
g1,FALL = 43 (A.4a)
h1,FALL = W2 + WPP R + 12W1 (A.4b)
p1,FALL = 43 (A.4c)
whereas the rise path has the following parameters:
g1,RISE = 23 (A.5a)
h1,RISE = 2W2 + WNP R + 1W1 (A.5b)
p1,RISE = 23 . (A.5c)
For the second stage, analysis for the fall path leads to
g2,FALL = 13 (A.6a)
h2,FALL = 4 + WLW2 (A.6b)p2,FALL = 1 (A.6c)
whereas the rise path has
g2,RISE = 23 (A.7a)
h2,RISE = 4 + WLW2 (A.7b)p2,RISE = 1. (A.7c)
-
1604 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)
SYSTEMS, VOL. 22, NO. 7, JULY 2014
By imposing equal stage efforts, the minimum delay of thefall
path is found to be
Dmin,CP3L,FALL=
2 (W2 + WPP R + 1)3W1
4 + WL3W2
+ 73
=
2CL3Cin
+ 73
(A.8)
where we neglected the small capacitance associated with
theminimum-sized transistors in the keeper in Fig. 18 (i.e., WL 4)
and the capacitance of the small precharge transistors (i.e.,W2 WPP
R+1). Under the same approximations
Dmin,CP3L,RISE=
2 (2W2 + WNP R + 1)3W1
4 + WL3W2
+53
4CL3Cin
+ 53. (A.9)
From comparison of (A.8)(A.9), the worst-case DQ delayof CP3L
and CSP3L is the rise path delay, given by (A.9).
ACKNOWLEDGMENTThe authors would like to thank the sponsors of
the Berke-
ley Wireless Research Center, STMicroelectronics, for
chipfabrication, and Prof. D. Blaauw and D. Sylvester for
testingsupport.
REFERENCES[1] S. Naffziger and G. Hammond, The implementation of
the next-
generation 64b itanium microprocessor, in Proc. IEEE ISSCC,Feb.
2002, pp. 276504.
[2] B. Dally, Architectures and circuits for energy-efficient
computing, inProc. CICC, Sep. 2012, pp. 110.
[3] M. Alioto, E. Consoli, and G. Palumbo, Flip-flop
energy/performanceversus clock slope and impact on the clock
network design, IEEE Trans.Circuits Syst., vol. 57, no. 6, pp.
12731286, Jun. 2010.
[4] C. Giacomotto, N. Nedovic, and V. Oklobdzija, The effect of
the systemspecification on the optimal selection of clocked storage
elements, IEEEJ. Solid-State Circuit, vol. 42, no. 6, pp. 13921404,
Jun. 2007.
[5] T. Fischer, S. Arekapudi, E. Busta, C. Dietz, M. Golden, S.
Hilker,A. Horiuchi, K. A. Hurd, D. Johnson, H. McIntyre, S.
Naffziger, J. Vinh,J. White, and K. Wilcox, Design solutions for
the Bulldozer 32nm SOI2-core processor module in an 8-core CPU, in
IEEE ISSCC Dig. Tech.Papers, Feb. 2011, pp. 7880.
[6] P. Gronowski, W. Bowhill, R. Preston, M. Gowan, and R.
Allmon,High-performance microprocessor design, IEEE J. Solid-State
Cir-cuits, vol. 33, no. 5, pp. 676686, May 1998.
[7] D. Bailey and B. Benschneider, Clocking design and analysis
for a600-MHz alpha microprocessor, IEEE J. Solid-State Circuits,
vol. 33,no. 11, pp. 16271633, Nov. 1998.
[8] S. Naffziger, High-performance processors in a power-limited
world,in Proc. Symp. VLSI Circuits, Jun. 2006, pp. 9397.
[9] (2011). International Technology Roadmap for Semiconductors
[Online].Available: http://www.itrs.net
[10] M. Alioto, E. Consoli, and G. Palumbo, Analysis and
compari-son in the energy-delay-area domain of nanometer CMOS
flip-flops:Part IMethodology and design strategies, IEEE Trans.
Very LargeScale Integr. (VLSI) Syst., vol. 19, no. 5, pp. 725736,
May 2011.
[11] M. Alioto, E. Consoli, and G. Palumbo, Analysis and
comparison inthe energy-delay-area domain of nanometer CMOS
flip-flops: Part IIResults and figures of merit, IEEE Trans. Very
Large Scale Integr.(VLSI) Syst., vol. 19, no. 5, pp. 737750, May
2011.
[12] M. Alioto, E. Consoli, and G. Palumbo, From energy-delay
metrics toconstraints on the design of digital circuits, Int. J.
Circuit Theory Appl.,vol. 40, no. 8, pp. 815834, Aug. 2012.
[13] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev,
and V. De,Comparative delay and energy of single edge-triggered and
dual edge-triggered pulsed flip-flops for high-performance
microprocessors, inProc. ISLPED, Aug. 2001, pp. 147152.
[14] V. Stojanovic and V. Oklobdzija, Comparative analysis of
master-slave latches and flip-flops for high-performance and
low-power sys-tems, IEEE J. Solid-State Circuits, vol. 34, no. 4,
pp. 536548, Apr.1999.
[15] H. Partovi, Clocked storage elements, in Design of
High-PerformanceMicroprocessor Circuits. Piscataway, NJ, USA: IEEE
Press,pp. 207234, 2001.
[16] N. Nedovic, V. Oklobdzija, and W. Walker, A clock skew
absorbingflip-flop, in IEEE ISSCC Dig. Tech. Papers, Feb. 2003, pp.
342497.
[17] D. Markovic, B. Nikolic, and R. Brodersen, Analysis and
design oflow-energy flip-flops, in Proc. Int. Symp. Low Power
Electron. Design,Aug. 2001, pp. 5255.
[18] C. Teh, T. Fujita, H. Hara, and M. Hamada, A 77%
energy-saving22-transistor single-phase-clocking D-flip-flop with
adaptive-couplingconfiguration in 40nm CMOS, in IEEE ISSCC Dig.
Tech. Papers,Feb. 2011, pp. 338340.
[19] E. Consoli, M. Alioto, G. Palumbo, and J. Rabaey,
Conditional pushpull pulsed latch with 726 fJops energy delay
product in 65nm CMOS,in IEEE ISSCC Dig. Tech. Papers, Feb. 2012,
pp. 482483.
[20] H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa,
K.Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y.
Sat-sukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, A
1.3GHzfifth generation SPARC64 microprocessor, in Proc. DAC, Jun.
2003,pp. 702705.
[21] M. Wieckowski, Y. M. Park, C. Tokunaga, D. W. Kim, Z.
Foo,D. Sylvester, and D. Blaauw, Timing yield enhancement through
softedge flip-flop based design, in Proc. CICC, Sep. 2008, pp.
543546.
[22] I. Sutherland, B. Sproull, and D. Harris, Logical Effort.
Designing FastCMOS Circuits. San Mateo, CA, USA: Morgan Kaufmann
Publishers,1999.
[23] M. Alioto, E. Consoli, and G. Palumbo, General strategies
to designnanometer flip-flops in the energy-delay space, IEEE
Trans. CircuitsSyst., vol. 57, no. 7, pp. 15831596, Jul. 2010.
[24] R. Ho, K. W. Mai, and M. A. Horowitz, The future of wires,
Proc.IEEE, vol. 89, no. 4, pp. 490504, Apr. 2001.
[25] T. Lang, E. Musoll, and J. Cortadella, Individual
flip-flops with gatedclocks for low power datapaths, IEEE Trans.
Circuits Syst. II, AnalogDigits Signal Process., vol. 44, no. 6,
pp. 507516, Jun. 1997.
[26] S. Heo and K. Asanovic, Load-sensitive flip-flop
characterization, inProc. CSW-VLSI, Apr. 2001, pp. 8792.
[27] S. Heo, R. Krashinsky, and K. Asanovic, Activity-sensitive
flip-flopand latch selection for reduced energy, IEEE Trans. Very
Large ScaleIntegr. (VLSI) Syst., vol. 15, no. 9, pp. 10601064, Sep.
2007.
[28] V. Oklobdzija, V. Stojanovic, D. Markovic, and N. Nedovic,
DigitalSystem Clocking: High-Performance and Low-Power Aspects,
NewYork, NY, USA: Wiley, 2003.
[29] J. D. Warnock, Y. H. Chan, W. V. Huott, S. M. Carey, M. F.
Fee,H. Wen, M. J. Saccamango, F. Malgioglio, P. J. Meaney, D. W.
Plass,Y. Chan, M. D. Mayo, G. Mayer, L. J. Sigal, D. L. Rude, R.
Averill,M. Wood, T. Strach, H. H. Smith, B. W. Curran, E. M.
Schwarz,L. Eisen, D. Malone, S. Weitzel, P. K. Mak, T. J.
McPherson, andC. F. Webb, A 5.2GHz microprocessor chip for the IBM
zEnter-prise system, in Proc. IEEE ISSCC Dig. Tech. Papers, Feb.
2011,pp. 7072.
[30] H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa,
K.Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y.
Sat-sukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, A
1.3GHzfifth generation SPARC64 microprocessor, in Proc. DAC, Jun.
2003,pp. 702705.
[31] N. Nedovic, W. Walker, and V. Oklobdzija, A test circuit
for measure-ment of clocked storage element characteristics, IEEE
J. Solid-StateCircuits, vol. 39, no. 8, pp. 12941304, Aug.
2004.
[32] S. G. Narendra and A. Chandrakasan, Leakage in Nanometer
CMOSTechnologies, New York, NY, USA: Springer-Verlag, 2006.
-
CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL
PUSHPULL PULSED LATCHES 1605
Elio Consoli was born in Catania, Italy, in 1983.He received the
masters degree in microelectronicengineering from the University of
Catania, Catania,in 2008, and the Ph.D. degree from the
Departmentof Electrical, Electronic and Information Engineer-ing,
University of Catania, in 2012.
He has been a Visiting Scholar with the BerkeleyWireless
Research Center, UC Berkeley, Berkeley,CA, USA, in 2010. In 2011,
he joined MaximIntegrated, Catania DC, as a Designer of analog
andmixed-signal ICs for interface, switch and protec-
tion products. He is the co-author of several scientific papers
on referredinternational journals and conferences. His current
research interests includeclocking strategies and energy-efficient
design for high-performance and lowpower digital VLSI systems in
nanometer CMOS technologies, as well as thedefinition of novel
circuits and design techniques to be employed in ultra-low-power
duty-cycled wireless sensor nodes.
Gaetano Palumbo (F07) was born in Catania,Italy, in 1964. He
received the Laurea degree inelectrical engineering and the Ph.D.
degree from theUniversity of Catania, Catania, in 1988 and
1993,respectively.
He conducts courses on electronic devices, elec-tronics for
digital systems and basic electronics in1993. In 1994, he joined
the Dipartimento Elet-trico Elettronico e Sistemistico, now
Dipartimentodi Ingegneria Elettrica Elettronica e dei Sistemi
withthe University of Catania, as a Researcher, becoming
an Associate Professor, in 1998. Since 2000, he has been a Full
Professorwith the same department. He is developing some the
research activities incollaboration with STMicroelectronics of
Catania. He was the co-author ofthree books CMOS Current Amplifiers
and Feedback Amplifiers: Theory andDesign and Model and Design of
Bipolar and MOS Current-Mode Logic(CML, ECL and SCL Digital
Circuits) (Kluwer Academic Publishers, 1999,2001 and 2005) and a
textbook on electronic device in 2005. He is theauthor of over 380
scientific papers on referred international journals (morethan 150)
and in conferences. He is the co-author of several patents.
Hisresearch has embraced digital circuits with emphasis on bipolar
and MOScurrent-mode digital circuits, adiabatic circuits, and
high-performance buildingblocks focused on achieving optimum speed
within the constraint of lowpower operation. His current research
interests include analog circuits withparticular emphasis on
feedback circuits, compensation techniques, current-mode approach,
and low-voltage circuits.
Dr. Palumbo was served as an Associate Editor of the IEEE
TRANSAC-TIONS ON CIRCUITS AND SYSTEMS PART I for the topic Analog
Circuitsand Filters and digital circuits and systems from 1999 to
2001 and from 2004to 2005. From 2006 to 2007, he served as an
Associate Editor of the IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS
PART II. From 2008 to 2011,he served as an Associate Editor of the
IEEE TRANSACTIONS ON CIRCUITSAND SYSTEMS PART I. In 2005, he was
one of the 12 panelists in thescientific-disciplinare area 09 -
industrial and information engineering of theCommittee for
Evaluation of Italian Research, which has the aim to evaluatethe
Italian research from 2001 to 2003. In 2003, he received the
DarlingtonAward. Since 2011, he has been a member of the Board of
Governors of theIEEE CAS Society.
Jan M. Rabaey (M83SM92F95) received thePh.D. degree in applied
sciences from KatholiekeUniversiteit Leuven, Leuven, Belgium.
He joined the Faculty with the Electrical Engi-neering and
Computer Science Department, Univer-sity of California, Berkeley,
CA, USA, in 1987,where he holds the Donald O. Pederson
Distin-guished Professorship. He is currently the
ScienticCo-Director with the Berkeley Wireless ResearchCenter,
Berkeley, CA, USA, and the Director ofthe Berkeley Ubiquitous
SwarmLab, Berkeley, CA,
USA. His current research interests include the conception and
implementationof next-generation integrated wireless systems.
Prof. Rabaey has received a wide range of major awards. He is a
memberof the Royal Flemish Academy of Sciences and Arts of
Belgium.
Massimo Alioto (M01SM07) was born in Bres-cia, Italy, in 1972.
He received the Laurea (M.Sc.)degree in electronics engineering and
the Ph.D.degree in electrical engineering from the Universityof
Catania, Catania, Italy, in 1997 and 2001, respec-tively.
He is an Associate Professor with the Departmentof Electrical
and Computer Engineering, NationalUniversity of Singapore,
Singapore. He was anAssociate Professor with the Department of
Informa-tion Engineering, University of Siena, Siena, Italy. In
2013, he was a Visiting Scientist with Intel Labs CRL,
Hillsboro, OR, USA,on ultra-scalable microarchitectures. From 2011
to 2012, he was a VisitingProfessor with the University of
Michigan, Ann Arbor, MI, USA, investigatingon active techniques for
resiliency in near-threshold processors, error-awareVLSI design for
wide energy scalability, and self-powered circuits. From 2009to
2011, he was a Visiting Professor with BWRC University of
California,Berkeley, CA, USA, investigating on next-generation
ultra-low power circuitsand wireless nodes. In 2007, he was a
Visiting Professor with EPFL -Lausanne, Lausanne, Switzerland. He
has authored or co-authored over 180publications on journals (60+,
mostly IEEE Transactions) and conferenceproceedings. He is the
co-author of two books Flip-Flop Design in NanometerCMOS - from
High Speed to Low Energy (Springer, 2013) and Model andDesign of
Bipolar and MOS Current-Mode Logic: CML, ECL and SCL
DigitalCircuits (Springer, 2005). His current research interests
include ultra-lowpower VLSI circuits, self-powered and wireless
nodes, near-threshold circuitsfor green computing, error-aware and
widely energy-scalable VLSI circuits,and circuit techniques for
emerging technologies.
Prof. Alioto was a member of the HiPEAC Network of Excellence
(EU)and the MuSyC FCRP Center, USA. From 2010 to 2012, he was the
Chairof the VLSI Systems and Applications Technical Committee of
the IEEECircuits and Systems Society, for which he was a
Distinguished Lecturerfrom 2009 to 2010 and a member of the DLP
Coordinating Committee from2011 to 2012. He currently serves as an
Associate Editor-in-Chief of theIEEE TRANSACTIONS ON VLSI SYSTEMS,
and served as a Guest Editorof various journal special issues
(including the issue on Ultra-Low VoltageCircuits and Systems for
Green Computing published in 2012 on IEEETRANSACTIONS ON CIRCUITS
AND SYSTEMS PART II). He serves or hasserved as an Associate Editor
of a number of journals (IEEE TRANSACTIONSON VLSI SYSTEMS, ACM
Transactions on Design Automation of ElectronicSystems, IEEE
TRANSACTIONS ON CAS - PART I, Microelectronics Journal,Integration
The VLSI Journal, Journal of Circuits, Systems, and
Computers,Journal of Low Power Electronics, and Journal of Low
Power Electronicsand Applications). He was a Technical Program
Chair of the ICECS in 2013,NEWCAS in 2012, and ICM in 2010
conferences, and a Track Chair in anumber of conferences (ICCD,
ISCAS, ICECS, VLSI-SoC, APCCAS, ICM).
/ColorImageDict > /JPEG2000ColorACSImageDict >
/JPEG2000ColorImageDict > /AntiAliasGrayImages false
/CropGrayImages true /GrayImageMinResolution 150
/GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true
/GrayImageDownsampleType /Bicubic /GrayImageResolution 600
/GrayImageDepth -1 /GrayImageMinDownsampleDepth 2
/GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true
/GrayImageFilter /DCTEncode /AutoFilterGrayImages false
/GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict >
/GrayImageDict > /JPEG2000GrayACSImageDict >
/JPEG2000GrayImageDict > /AntiAliasMonoImages false
/CropMonoImages true /MonoImageMinResolution 400
/MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true
/MonoImageDownsampleType /Bicubic /MonoImageResolution 1200
/MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000
/EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode
/MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None
] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false
/PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000
0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true
/PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ]
/PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier ()
/PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped
/False
/Description >>> setdistillerparams>
setpagedevice