Top Banner
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 7, JULY 2014 1593 Novel Class of Energy-Efficient Very High-Speed Conditional Push–Pull Pulsed Latches Elio Consoli, Gaetano Palumbo, Fellow, IEEE , Jan M. Rabaey, Fellow, IEEE , and Massimo Alioto, Senior Member, IEEE Abstract—In this paper, a new class of pulsed latches is introduced and experimentally assessed in 65-nm CMOS. Its conditional push–pull pulsed latch topology is based on a push– pull final stage driven by two split paths with a conditional pulse generator. Two circuit implementations of the concept are discussed, with their main difference being in the pulse generator, which can be either shared (CSP 3 L) or not (CP 3 L). Measurements show that the proposed topology is very fast, as it outperforms the well-known transmission gate pulsed latch (TGPL) [1] by 1.5×–2×; hence the proposed pulsed latch has the highest performance ever reported. The proposed pulsed latch is also shown to significantly improve the energy efficiency compared to the state of the art. Indeed, a 2.3× improvement in ED 3 product (energy × delay 3 ) over TGPL was found for designs targeting minimum ED 3 . For designs targeting minimum ED, a 1.3× improvement was found in ED product. This comes at the cost of a 1.15×-1.35× cell area penalty, which translates into an overall area increase well below 1% in typical systems. Measurements on 256 replicas confirm that the above benefits are kept in the presence of variations. Accordingly, the proposed class of pulsed latches goes beyond the current state of the art and is well suited for VLSI systems that require both high performance and energy efficiency. Index Terms— Clocking, energy efficiency, energy-delay tradeoff, flip-flops (FFs), high speed, low power, nanometer CMOS, pulsed latches, VLSI. I. I NTRODUCTION F LIP-FLOPS (FFs) and latches are well known to be responsible for a large fraction of the power budget of microprocessors and VLSI systems [1]–[7]. Typically, they dissipate 80% of the total clock power [5], and 30% of the overall power budget [2]. Energy efficiency of FFs and latches is nowadays even more critical than in the past, considering that speed can be increased only through improvements in energy efficiency, since VLSI systems are power limited [2], [8], [9]. Therefore, the search for novel topologies with a Manuscript received February 12, 2013; revised July 11, 2013; accepted July 28, 2013. Date of publication September 9, 2013; date of current version June 23, 2014. E. Consoli is with Maxim Integrated Products, Catania 92100, Italy (e-mail: [email protected]). G. Palumbo is with the DIEEI, Università di Catania, Catania I-95125, Italy (e-mail: [email protected]). J. M. Rabaey is with the Electrical Engineering and Computer Science Department, University of California, Berkeley, CA 94720 USA (e-mail: [email protected]). M. Alioto is with the Electronics and Computer Engineering Depart- ment, National University of Singapore, 117576 Singapore (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2013.2276100 Fig. 1. Pareto-optimal energy-delay curve of existing FF topologies for a typical load of 16 minimum inverters (energy per cycle and D–Q delay are in arbitrary units). (a) (b) Fig. 2. (a) TGPL topology. (b) Pulse generator topology (area in dashed line is shareable among multiple cells). targeted speed under a relatively low consumption (with their tradeoff quantified by composite E i D j metrics [10]–[12]) is crucial. Among state-of-the-art topologies, pulsed latches typically exhibit the best energy efficiency from moderate to high performance design targets, among the existing classes of FFs [10]–[15]. In particular, from moderate to very high performance targets, only very few topologies belong to the Pareto-optimal curve of designs having minimum energy for a given performance [10], [11]. As recalled in Fig. 1, the transmission gate pulsed latch (TGPL) [1] (see Fig. 2) used in various Intel microprocessors is the most energy-efficient FF in a rather wide portion of the Pareto-optimal curve, ranging from high-speed (i.e., points with minimum ED j product with j > 1) to energy-efficient designs (i.e., points with minimum ED). Only the skew-tolerant FF (STFF) is able 1063-8210 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
13

Novel Class of Energy-Efficient

Nov 08, 2015

Download

Documents

Prathap Sankar

Novel idea on energy generation and bsaic description on energy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 7, JULY 2014 1593

    Novel Class of Energy-Efficient Very High-SpeedConditional PushPull Pulsed Latches

    Elio Consoli, Gaetano Palumbo, Fellow, IEEE, Jan M. Rabaey, Fellow, IEEE,and Massimo Alioto, Senior Member, IEEE

    Abstract In this paper, a new class of pulsed latches isintroduced and experimentally assessed in 65-nm CMOS. Itsconditional pushpull pulsed latch topology is based on a pushpull final stage driven by two split paths with a conditionalpulse generator. Two circuit implementations of the conceptare discussed, with their main difference being in the pulsegenerator, which can be either shared (CSP3L) or not (CP3L).Measurements show that the proposed topology is very fast,as it outperforms the well-known transmission gate pulsedlatch (TGPL) [1] by 1.52; hence the proposed pulsed latchhas the highest performance ever reported. The proposed pulsedlatch is also shown to significantly improve the energy efficiencycompared to the state of the art. Indeed, a 2.3 improvement inED3 product (energy delay3) over TGPL was found for designstargeting minimum ED3. For designs targeting minimum ED,a 1.3 improvement was found in ED product. This comes atthe cost of a 1.151.35 cell area penalty, which translatesinto an overall area increase well below 1% in typical systems.Measurements on 256 replicas confirm that the above benefits arekept in the presence of variations. Accordingly, the proposed classof pulsed latches goes beyond the current state of the art and iswell suited for VLSI systems that require both high performanceand energy efficiency.

    Index Terms Clocking, energy efficiency, energy-delaytradeoff, flip-flops (FFs), high speed, low power, nanometerCMOS, pulsed latches, VLSI.

    I. INTRODUCTION

    FLIP-FLOPS (FFs) and latches are well known to beresponsible for a large fraction of the power budget ofmicroprocessors and VLSI systems [1][7]. Typically, theydissipate 80% of the total clock power [5], and 30% of theoverall power budget [2]. Energy efficiency of FFs and latchesis nowadays even more critical than in the past, consideringthat speed can be increased only through improvements inenergy efficiency, since VLSI systems are power limited[2], [8], [9]. Therefore, the search for novel topologies with a

    Manuscript received February 12, 2013; revised July 11, 2013; acceptedJuly 28, 2013. Date of publication September 9, 2013; date of current versionJune 23, 2014.

    E. Consoli is with Maxim Integrated Products, Catania 92100, Italy (e-mail:[email protected]).

    G. Palumbo is with the DIEEI, Universit di Catania, Catania I-95125, Italy(e-mail: [email protected]).

    J. M. Rabaey is with the Electrical Engineering and Computer ScienceDepartment, University of California, Berkeley, CA 94720 USA (e-mail:[email protected]).

    M. Alioto is with the Electronics and Computer Engineering Depart-ment, National University of Singapore, 117576 Singapore (e-mail:[email protected]).

    Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TVLSI.2013.2276100

    Fig. 1. Pareto-optimal energy-delay curve of existing FF topologies for atypical load of 16 minimum inverters (energy per cycle and DQ delay arein arbitrary units).

    (a) (b)

    Fig. 2. (a) TGPL topology. (b) Pulse generator topology (area in dashed lineis shareable among multiple cells).

    targeted speed under a relatively low consumption (with theirtradeoff quantified by composite Ei D j metrics [10][12]) iscrucial.

    Among state-of-the-art topologies, pulsed latches typicallyexhibit the best energy efficiency from moderate to highperformance design targets, among the existing classes ofFFs [10][15]. In particular, from moderate to very highperformance targets, only very few topologies belong to thePareto-optimal curve of designs having minimum energy fora given performance [10], [11]. As recalled in Fig. 1, thetransmission gate pulsed latch (TGPL) [1] (see Fig. 2) usedin various Intel microprocessors is the most energy-efficientFF in a rather wide portion of the Pareto-optimal curve,ranging from high-speed (i.e., points with minimum E D jproduct with j > 1) to energy-efficient designs (i.e., pointswith minimum ED). Only the skew-tolerant FF (STFF) is able

    1063-8210 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • 1594 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 7, JULY 2014

    to outperform transmission gate flip-flop (TGFF) for extremelyhigh-speed design targets [16] (i.e., points with minimum ED jfor j 5). In this region, the STFF speed advantage in termsof DQ delay is typically about 10%, at the cost of a 2greater energy [11]. Hence, although STFF is slightly betterthan TGPL in terms of pure performance, but its significantlyworse energy efficiency does not make it as competitive asTGPL in applications where energy efficiency is a concern.Hence, in the following, TGPL will be adopted as a referencefor high-speed energy-efficient designs. When slower designslower design targets are considered, master-slave FFs exhibitbetter energy efficiency. The traditional TGFF [17] and therecently proposed Toshiba ACFF [18] are, respectively, themost efficient among designs with balanced energy-delay (i.e.,minimum ED) and ultralow energy designs (i.e., minimumE j D with j > 1).

    In this paper, a novel class of pulsed latches (conditionalpushpull pulsed latch) is introduced. The main idea is toadopt a pushpull output stage, which is driven by two splitpaths for rise and fall output transitions, with the explicit aimof reducing both the path effort and the parasitic delay [19].In addition, the capacitance at the output of the first stage isfurther reduced by adopting half-latches in the split paths andmoving the cross-coupled inverters to the output node.

    Two versions are presented, respectively, without (CP3L)and with (CSP3L) shareable conditional pulse generator. Mea-surements on a 65-nm test chip demonstrate 1.32.3 betterenergy efficiency compared to TGPL, as well as 1.52DQ delay improvement even in the presence of processvariations. The proposed pulsed latches have a 1.151.35larger area than TGPL, with a resulting increase in the areaof practical VLSI systems that is well below 1%.

    This paper is organized as follows. In Section II, therationale behind the herein proposed novel topologies and theiroperation is described, and their detailed circuit implementa-tion is discussed in Section III. The potential speed advantagecompared to TGPL is analytically evaluated in Section IV,and aspects related to physical design and layout parasiticsare discussed in Section V. Details on the 65-nm test chipand the adopted delay-energy testing circuitry are provided inSection VI. Measurements results and comparison with state-of-the-art topologies are discussed in Section VII. Conclusionsare reported in Section VIII, and an Appendix presents adetailed logical effort analysis.

    II. CONDITIONAL PUSHPULL PULSED LATCH:MAIN IDEAS AND OPERATION

    In the proposed class of pulsed latches shown in Fig. 3,a pushpull output stage is adopted (M7M8) as opposed tothe traditional output inverter stage employed in most existingtopologies [10][18] (see M5M6 in TGPL in Fig. 2). Such atechnique allows for reducing the load of the driving circuitryby a factor 23, thereby making it faster and more energy-efficient. This also allows M7M8 in Fig. 3 to be up-sized,and hence have a faster output stage.

    The pushpull output stage in Fig. 3 is driven by twosplit paths that generate the active-high R (active-low set S)

    Fig. 3. General scheme of the proposed class of pulsed latches.

    pulsed signal, which resets (sets) the output when active.Pulses R and S are alternatively generated to enable a fall/riseoutput transition, respectively. These pulses are generated atthe falling clock edge by the conditional pulse generator inFig. 3, and are transferred to the output stage by either thehalf latch M1M3 or M4M6, depending on whether input Dis, respectively, low or high (see below for detailed descriptionof pulse waveforms). These half latches in the first stage withinthe DQ critical path have less parasitics compared to typi-cal clocked inverters or inverters with cascaded transmissiongate [10][18] (see M1M4 in Fig. 2). The input D drivestwo different paths, respectively, through an nMOS (M5) anda pMOS (M2) transistor in Fig. 3, which is equivalent to theload of a traditional input inverter stage (see M1M2 in TGPLin Fig. 2).

    The operation of the scheme in Fig. 3 is explained in detailin Fig. 4, which depicts the main waveforms of the internalsignals. After the falling clock edge (cycle 1 in Fig. 4), thepulse generator checks if the previous output1 QD in Fig. 3is high or low. If previous output is QD = 1, next output Qcan stay at the same value or make a falling transition, hencea pulse is generated in the fall path in Fig. 3 through theactive-low signal CP f , whereas nothing changes in the risepath (active-high signal CPr is kept low, thus latch M4M6keeps S high and maintains M8 OFF). Subsequently, if inputstays at the previous value D = 1, the latch M1M3 is notenabled; hence R is dynamically kept at the previous valueR = 0 (then, it is statically tied to ground once the pulseexpires). On the other hand, if input changes to D = 0, thelatch M1M3 is enabled and the CP f pulse determines a highpulse in R, which turns M7 ON and brings the output Q tolow. Afterwards, its delayed output replica QD experiencesthe same transition.

    If the previous output is QD = 0, right after the falling clockedge (cycle 2 in Fig. 4), a pulse is generated in the rise paththrough the active-high signal CPr (nothing changes in thefall path). If input stays at the previous value D = 0, the latchM4M6 is disabled and S is kept high, so that nothing changesin the rise path. If input changes to D = 1, the latch M4M6 isenabled and the CPr pulse pulls down S, thereby turning M8

    1More precisely, the delayed version QD of the output Q is fed back tothe conditional pulse generator. As explained below, feeding back QD (ratherthan Q) permits to reduce the internal activity and hence energy per cycle.

  • CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES 1595

    Fig. 4. Waveforms of internal signals of the general scheme in Fig. 3.

    Fig. 5. CP3L topology (area in dashed line is shareable among multiplecells).

    ON and bringing Q to high. Afterwards, the delayed outputreplica QD experiences the same transition.

    At the steady state, R (S) in Fig. 3 is set to 0 (1), therebyturning OFF the output transistors M7M8, with the outputbeing maintained at the desired value by a keeper. In otherwords, the memory element within the proposed topology inFig. 3 is actually placed at the output node, as opposed to mostof the existing topologies where it is placed before the outputstage (see the gated cross-coupled inverter pair in Fig. 2, whichis connected to the input of the output stage M5M6). Thispermits to move the parasitics associated with the memoryelement to the output node, thereby making the input nodeof the output stage lightly loaded, and hence faster and moreenergy efficient.

    III. IMPLEMENTATION OF THE CONDITIONALPUSHPULL PULSED LATCH CONCEPT: CP3L AND

    CSP3L TOPOLOGIES

    As discussed above, the proposed class of pulsed latchin Fig. 3 tends to have a lightly loaded DQ criticalpath, thereby making it potentially fast and energy-efficient.Such features can be implemented in different ways. In thefollowing, we present two versions, respectively, without(Section III-A) and with (Section III-B) shareable pulsegenerator.

    A. CP3L: Conditional PushPull Pulsed Latch

    The schematic of CP3L topology is depicted in Fig. 5.The keeper (M9M12 in Fig. 5) drives the output Q and com-prises a cross-coupled inverter pair, whose forward inverteris gated to avoid current contention with the output stageM7M8. Indeed, if R = 1 the pull-down M7 of the outputstage is ON and the pull-up network of the keeper is OFFthrough M11. Analogously, if S = 0 the pull-up M8 of theoutput stage is ON and the pull-down network of the keeperis OFF through M10.

    As an additional advantage brought by placing the keeperafter the output stage rather than before, CP3L has lighter loadon its critical path since the half latch M1M3 (M4M6) in thefirst stage has to drive the single transistor M11 (M10). Also,since the two pulses R and S are alternatively generated, eitherM10 or M11 in the keeper are actually subject to transitionsof the gate terminal in a given cycle. In contrast, the first stageof traditional topologies must drive two transistors associatedwith the keeper, and both of them are subject to transitions[10][18] (see transistors M11M12 in Fig. 2, which loadtransistors M3M4 lying in the critical path). This clearlyreduces the parasitic load of the first stage of CP3L and reducesactivity at the keeper capacitances, thereby making the firststage faster and potentially more energy efficient.

    Regarding the pulse generator, it comprises a clockphase generator, a pseudo-NAND for the fall path(M15M19 in Fig. 5), and a pseudo-NOR gate for therise path (M20M24). Operation is summarized in Fig. 6,which depicts the waveforms of the signals involved inthe generation of the CP f and CPr pulses. Generally, thepseudo-NAND (pseudo-NOR) gate sets signal CP f (CPr) high(low), since signals CK(I )N and CK(I V ) (CK and CK(III)N )are complementary and thus keep either transistor M18 orM19 (M20 or M21) ON. However, after the falling clockedge, signals CK(I )N and CK

    (I V ) (CK and CK(III)N ) are bothtemporarily high (low) due to the transitions of the fourinverters within the clock phase generator (in the example inFig. 6, each inverter is assumed to have the same delay invfor simplicity). Accordingly, during the time slot inv4inv inFig. 6, the pseudo-NAND temporarily sets CP f low throughtransistors M15M17 if QD = 1 (otherwise, CP f remainshigh). Similarly, during the time slot 03inv in Fig. 6, thepseudo-NOR temporarily sets CPr high through transistorsM22M24 if QD = 0 (otherwise, CP f remains low). Hence,

  • 1596 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 7, JULY 2014

    Fig. 6. Clock phase generator and waveforms defining CPr and CP f pulses.

    the clock phase generator and the pseudo-NAND/NOR gatesimplement a conditional pulse generator, which alternativelyproduce a pulse on either CP f or CPr , as determined by theprevious output value QD . The clock phase generator can beshared among multiple latches to amortize its overhead.

    It is useful to observe that the width of CP f and CPr pulsesdetermines the width of the transparency window of CP3Llatch in which the input can affect the output. From a designpoint of view, the width of the transparency window can bemodified by changing the delay of the inverters within theclock phase generator in Fig. 5. The effect of process variationson timing can be compensated through post-silicon tuning ofthe pulse width, possibly sharing the tuning circuitry amongmultiple latches [1], [20], [21]. In this paper, no tune-abilityis added to the considered pulsed latches since the additionof such feature would impact area/energy of any pulsed latchequally. Indeed, almost all existing pulsed latches adopt thesame pulse generator topology (e.g., cascaded inverters as inFigs. 2, 5, and 6) [10], [11].

    The delay stage in the feedback path in Figs. 35 generatesa delayed replica QD of the output Q, and is implemented bythe two inverters M13M14 and M25M26 in Fig. 5. Actually,only slow transistors M25M26 are added to implement suchdelay, as the inverter M13M14 is already available (i.e.,M13M14 are used to both latch and delay the output).This delay stage makes sure that QD is kept stable at its previ-ous value during the transparency window, thereby preventingglitches in CPr and CP f and reducing dynamic energy, asdiscussed in the following.

    Without the delay stage, the output Q would be connecteddirectly to the pseudo-NAND/NOR in Fig. 5, hence any out-put transition within the transparency window immediatelytriggers the generation of an additional (undesired) pulse.As shown in detail in Fig. 7, which refers to the casewhere Q is directly connected to the pseudo-NAND/NOR, afalling transition of Q following the same input transitionimmediately triggers a high pulse in CPr , as the pseudo-NOR in Fig. 5 temporarily has all pMOS transistors M22M24 ON during the transparency window (i.e., the CPr timeslot in Fig. 6). Observe that this glitch in CPr pulse increasesthe dynamic energy, but it does not affect correct operation.Indeed, if previous output was Q = 1 and the current input isD = 0 as in Fig. 7, the CPr glitch cannot propagate through

    `

    D

    CK

    CPr

    CPf

    R

    QQ = 1

    S

    D = 0

    R = 1

    cycle 1

    Q = 0

    glitch on CPr

    Fig. 7. Glitch in CPr occurring if no delay stage is inserted in the feedbackpath in Figs. 35.

    the half latch M4M6 since M5 is OFF. On the other hand,if the previous output was Q = 1 and the current input isD = 1, the CPr glitch propagates through the half latchM4M6 and temporarily sets S = 0, but it does not affectthe output anyway since the latter is kept at the desired valueQ = 1 through M8. Dual considerations hold for glitches inCP f when no delay stage is inserted. As a result, the delaystage in Figs. 35 is not strictly necessary, but its insertionreduces the activity in CPr and CP f and hence energy.

    B. CSP3L: Conditional Shareable PushPull Pulsed LatchIn CP3L, the pulse generator cannot be shared among

    multiple latches since pseudo-NOR/NAND are driven by QD ,which is different for each latch. In this subsection, we presenta different implementation of the same concept by integratingthe conditional logic in the latch so that the whole pulsegenerator can be shared. The resulting conditional shareablepushpull pulsed latch (CSP3L) topology is depicted in Fig. 8.

    In CSP3L, static NAND/NOR gates are introduced in theshareable pulse generator to generate the pulses CPf,ext andCPr,ext that are distributed to multiple latches and have thesame role as CP f and CPr had in CP3L. In each latch, suchexternal pulses are enabled through the switches implementedby M16M22 in Fig. 8, which implement the conditionalpulse selection logic. The latter comprises two transmissiongates and two small keepers to maintain the same operationas before. As discussed above, the delay stage M23M26 isintroduced in the feedback path (two more than CP3L sincethe transmission gates need complementary control signals).The resulting transistor count is the same as CP3L, henceCSP3L area is expected to be roughly the same as CP3L(excluding the shareable part).

    Since CSP3L is based on the same concept as CP3L, oper-ation is very similar. The main difference is in the conditionalpulse selection logic, which enables the propagation of eitherCPf,ext or CPr,ext to the half latches, according to the valueof the delayed output replica QD . In particular, if QD = 1(QD = 0) the fall (rise) path is activated, as the transmissiongate M15M16 (M19M20) transfers the CPf,ext (CPf,ext)

  • CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES 1597

    Qn,DQn

    R

    D

    D

    Q

    CPf

    S

    S

    R

    M1

    M2M3

    M4M5

    M6 M7

    M8

    M9

    M10

    M11

    M12

    M25

    M26delay

    outputstage

    halflatches

    conditionalpulse selection

    M13M14

    FALLPATH

    RISEPATH

    QD

    QD

    Qn,D

    QD

    CPr,ext

    CPf,ext

    CPr

    CPf,ext

    CPr,ext

    M15M16

    M17 M18

    M19M20

    M21M22

    M23

    M24

    shareable pulse generator

    CK

    CK(IV)

    CKn(III)

    CKn(I)

    CKCKn(III)

    CKn(I)

    CK(IV)

    CPr,ext

    CPf,ext

    Fig. 8. CSP3L topology (area in dashed line is shareable among multiplecells).

    pulse to the input of the half latch M1M3 (M4M6), similarto the pseudo-NAND (pseudo-NOR) of CP3L in Fig. 5. As aminor difference from CP3L, the input capacitance seen fromCPf,ext and CPr,ext in CSP3L depends on Q, which may leadto data-dependent clock skew (see Fig. 8). In practical cases,this is not a concern considering that pulsed latches inherentlytolerate a significant amount of skew.

    IV. ANALYSIS OF SPEED POTENTIALIn this section, CP3L and CSP3L are comparatively evalu-

    ated to TGPL in terms of maximum achievable performancethrough logical effort analysis [22]. According to the analysisunder the assumptions in the Appendix, the minimum DQdelay normalized to the technology-dependent constant [22]for the CP3L, CSP3L, and TGPL topology is

    Dmin,CP3L Dmin,CSP3L

    43

    CLCin

    + 53

    (1)

    Dmin,TGPL

    53

    CLCin

    + 349

    (2)

    where CL and Cin are, respectively, the load and the inputcapacitance of the pulsed latch. From (1)(2), CP3L andCSP3L have basically the same minimum DQ delay, as isexpected by considering that they have the same DQ criticalpath (M1M8 in Figs. 5 and 8).

    From (1)(2), CP3L and CSP3L are always faster thanTGPL. Their theoretical maximum speed advantage is about2.3 and is obtained at light loads (i.e., electrical effort

    CL /Cin 1). For typical electrical efforts ranging from10 to 30, the potential speed advantage is 1.41.5, anddecreases to 1.3 for 60 or more. Although this analysis doesnot account for wire parasitics, which will be included in thenext section, it suggests that the potential advantage of CP3Land CSP3L over TGPL typically ranges from 1.4 to 2.

    The above speed improvement is justified by the lighterload of the stages lying in the critical path, as was discussedin detail in the previous section. Logical effort analysis inthe Appendix permits to quantify the advantages of CP3L andCSP3L in each critical path stage. Comparison of (A1)(A5)and (A2)(A7) clearly shows that CP3L and CSP3L have aspeed advantage over TGPL both in the first and second stage.In particular, the first stage has 1.25 lower logical effortand 2 lower parasitic delay thanks to the lighter loadingeffect of parasitics, compared to TGPL. In addition, the secondstage has 1.5 lower logical effort thanks to the pushpullconfiguration.

    The potential performance improvement enabled by CP3Land CSP3L is kept also in the presence of layout parasitics,as will be discussed in the next section, and can be traded offfor significantly lower energy at iso-performance, as will bedemonstrated in Section VII.

    V. LAYOUT-AWARE SIZING METHODOLOGY ANDPHYSICAL-LEVEL CONSIDERATIONS

    Although the analysis in the previous section shows aclear advantage of CP3L and CSP3L in terms of maximumperformance, in practical cases transistors are optimized tohave a reasonable balance with energy. To explore the energy-delay tradeoff in meaningful design cases, the pulsed latcheswere sized to minimize the energy-delay ED j figure of merit[10], [11]. The resulting designs are energy optimal, in thesense that they belong to the Pareto-optimal energy-delaycurve [23]. In the following, we focus on the design pointswith minimum ED and ED3, which are, respectively, repre-sentative of applications targeting balanced energy-delay andhigh speed [10], [11].

    Layout parasitics are well known to be comparable to(or dominate) device parasitics even in internal nodes ofstandard cells, and hence they have a considerable impacton the optimal transistor sizes that minimize energy for agiven performance constraint [24]. However, most of theprevious work on FF/latch sizing and comparison presentssub-optimal designs in terms of energy efficiency, as theyincludes layout parasitics only after transistor are actuallysized [13], [17], [25][28]. In contrast, in this paper, layoutparasitics are explicitly included into the circuit design loopby resorting to the layout-aware sizing methodology proposedby the same authors in [23]. In short, a preliminary layoutorganization was setup in the form of stick diagram, thenlayout parasitics were estimated from the stick diagram as afunction of transistor sizes based. Successively, transistor sizeswere optimized for the targeted energy-delay figure of meritby including estimated layout parasitics in the optimization.

    Fig. 9(a)(c) show the layout of a TGPL, CP3L, and CSP3Lfor a minimum energy-delay target and a load of 16 minimum-sized inverters. In these figures, the dashed line defines the

  • 1598 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 7, JULY 2014

    (a)

    (b)

    (c)

    Fig. 9. Layout under sizing for minimum ED. (a) TGPL. (b) CP3L.(c) CSP3L. (Area in dashed line is shareable among multiple cells.)

    TABLE IAREA COMPARISON (65 nm, STD CELL HEIGHT: 3.9 m)

    portion that can be shared and amortized among multiplelatches (i.e., the pulse generator). These figures show thatthe shareable portion occupies a significant fraction of theoverall latch area. According to data in Table I, TGPL isconfirmed to have a very low area, as is well known fromthe comparison with other existing topologies [10], [11].In particular, TGPL offers a 1.15 (1.28) area reduction overCP3L (CSP3L), when pulse generator is not shared. When thelatter is shared, TGPL has a 1.33 (1.35) area reduction overCP3L (CSP3L), which is similar to the advantage that TGPLhas compared to most of the area-efficient existing topologies[10], [11]. As a numerical example, in typical microprocessorsystems, latches typically occupy less than 5% of the totalarea2 hence the adoption of CP3L (CSP3L) as replacement ofall FFs would lead to 1.6% (1.7%) area increase comparedto TGPL. In practical cases, the area increase resulting fromthe adoption of CP3L or CSP3L is well below 1%, sincemore traditional topologies are typically used in noncriticalpaths [28].

    2This estimate is based on the consideration that the cache memory takesup more than 50% of the area of todays processors [29], and flip-flops/latchestypically account for 1% of the total gate count of a core [30], with theirsize being in the same order of magnitude as other cells.

    Scan

    Cha

    in (S

    C)

    384:

    1

    Fig. 10. Architecture of the testchip.

    Finally, post-layout parasitic extraction on different designpoints showed that the intracell wire parasitics at the outputof the first and second stage of CP3L and CSP3L are verysimilar (within few percents) to those of TGPL for a givenenergy-delay target. In other words, the qualitative energy-performance advantages expected from Section III and thequantitative performance benefits estimated in Section IV areexpected to hold in practical cases where layout parasitics areincluded (details will be provided in Section VII).

    VI. ON-CHIP TESTING HARNESSA test chip in 65-nm CMOS was prototyped to validate

    the proposed class of pulsed latches. The test chip containsCP3L, CSP3L, and TGPL latches implemented in two ver-sions, respectively, targeting minimum ED and minimum ED3design, with a 16 and 64 load (1 is equivalent to theinput capacitance of a minimum-sized symmetric inverter).Each latch version was implemented in 64 replicas. To the bestof our knowledge, this is the first paper that validates novellatch topologies through measurements of multiple replicas.

    The architecture of the test chip is shown in Fig. 10.A scan chain (SC) scans in the test settings and scans outmeasurement data. A test controller (TC) manages and appliessettings throughout the latch arrays and steers measurementdata. For each measurement, TC triggers the generation ofa pulse with appropriate width through the pulse generator,and the delay generator (DG) generates delayed versions oftest signals, whose delay can be tuned with a 1.8-ps step(its use is discussed below). Each latch (test unit in Fig. 10)is excited after a preliminary coarse selection to reduce thenumber of cells switching at the same time, and then a finalselection to steer measurements to SC. The testing harnessmeasures timing parameters and energy of the above latches.Two different arrays are used to measure energy and delay,so that the reconfigure-ability (i.e., parasitics) added to bettermeasure one of the two parameters does not interfere with theother. The die photo is in Fig. 11.

    The energy is measured through an external 1-V pin sup-plying power to nine replicas. The activity factor can be tunedamong the following values, which widely cover practicalapplications: 6.25%, 12.5%, 25%, 50%, and 100%. To separatedynamic and leakage energy components, leakage of a bank of

  • CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES 1599

    TABLE IIAVERAGE AND STANDARD DEVIATION OF MAIN PARAMETERS OF INTEREST (256 REPLICAS, MIN.-E D DESIGN)

    TABLE IIICOMPARISON WITH STATE OF THE ART

    latches is measured through a separate pin. Regarding timingparameters, the testing harness proposed in [31] has beenimplemented to measure DCK, CKQ, and DQ delay ofeach latch replica. Each delay is measured as difference ofthe arrival times of the involved signals (e.g., D and CK forDCK delay). The test unit for timing characterization of asingle latch is schematized in Fig. 12, which shows that D,CK, and Q of the latch under test are MUXed and captured bya master slave (transmission-gate) flip-flop clocked by clockCKMS. The arrival time of each signal relative to CKMS edgeis measured by sweeping the tunable delay between it andCKMS, and checking when the capturing FF starts failing [31](beyond that point, data is incorrectly captured for greater

    Fig. 11. Die photo.

    Fig. 12. Block diagram of test unit for timing characterization.

    delays). For example, DCK delay can be measured accordingto the following steps [31].

    1) Select CK through MUX, sweep tunable delay betweenCK and CKMS. Beyond a certain delay, CK is capturedincorrectly. The delay that was applied immediatelybefore defines the arrival time of CK relative to CKMS.

    2) Repeat same steps by sweeping delay between D andCKMS. This defines arrival time of D relative to CKMS.

    3) Evaluate DCK delay as the difference between the twoarrival times.

    Differently from [31], we characterized an array of replicasby using a single DG to test all replicas, to enable full-timing characterization of the pulse generator. To keep layoutparasitics small and have good control of the latch load, eachreplica has its own local measurement unit (see Fig. 12) andthey are placed next to each other.

  • 1600 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 7, JULY 2014

    (a)

    (b)

    Fig. 13. Setup characteristic: CKQ and DQ delay versus CK-D of latchesdesigned for (a) minimum ED (16 load) and (b) minimum ED3 (64 load).

    VII. MEASUREMENT RESULTSThe above-described testing harness was used to character-

    ize CP3L, CSP3L, and TGPL latch.

    A. Performance, Hold Time Robustness, and EnergyThe measured setup curves (i.e., CKQ and DQ delay

    versus CKD) are reported in Fig. 13(a) and (b), respec-tively, for the design targeting minimum ED and ED3. Themeasured replica is representative of the nominal corner inthis process. From Fig. 13a (13b), CP3L and CSP3L havevery similar minimum DQ delay, as expected. DQ delay ofCP3L (CSP3L) is 17.3 ps (17.9 ps) for minimum-ED sizing,while it is 15.6 ps (16.1 ps) for minimum-ED3. From thesame figures, the TGPL latch under the same conditions,respectively, achieves 34.6 and 24 ps. Accordingly, the TGPLis slower than CP3L (CSP3L) by 2.03 (1.92) for theminimum ED design, and 1.54 (1.47) for minimum ED3.This is particularly interesting, considering that TGPL is wellknown for being the fastest existing topology among thosewith reasonably high energy efficiency [11] (and only 10%slower than the very fastest).

    The hold time of CP3L (CSP3L) results to 90.5 ps (99.3 ps)for minimum-ED design, and 123.6 ps (130.1 ps). On theother hand, TGPL has a hold time of 121.9 ps and 171.1 ps

    Fig. 14. Energy per cycle versus data activity.

    for the minimum-ED and minimum-ED3 design, respectively.Hence, CP3L and CSP3L have a hold time that is slightlybetter than TGPL (by about 1.3) for both the minimum-EDand minimum-ED3 design.

    The transient energy per cycle ETRAN (i.e., dynamic andshort-circuit) is plotted in Fig. 14 versus data activity. This fig-ure shows that energy of CP3L and CSP3L is from 40% to 60%higher than TGPL depending on the specific activity. Energyitself is clearly not representative of energy efficiency, asit should be evaluated as iso-performance. The energy-delaytradeoff of the above topologies for the different design targetsis depicted in Fig. 15(a), which shows that the minimum-ED CP3L and CSP3L (which are once again very close toeach other) is even faster and consumes less energy thanthe minimum-ED3 TGPL. More quantitatively, the energy ofCP3L, CSP3L, and TGPL for 25% data activity is, respectively,42, 41.5, and 26.1 fJ for minimum-ED energy, hence CP3L andCSP3L exhibit a 1.3 better energy-delay product compared toTGPL. For minimum-ED3 design, the energy of CP3L, CSP3L,and TGPL is 73.7, 75.7, and 46.1 fJ, hence CP3L and CSP3Limprove ED3 by 2.3, compared to TGPL. From Fig. 14,similar or better energy efficiency is expected at other realisticvalues of data activity. The energy improvement enabledby CP3L and CSP3L is intuitively explained by consideringthat these topologies are significantly faster than TGPL (seeSection IV). Hence, CP3L and CSP3L tend to have smallertransistor sizes for a given performance target, which in turntranslates into smaller dynamic and leakage energy comparedto TGPL.

    Leakage can also be a concern in FF and latches, forexample, in VLSI systems operating in standby mode whileretaining information in registers and power gating all othergates [32]. The leakage current under equiprobable inputs forCP3L, CSP3L, and TGPL is 316, 401.6, and 424.6 nA, respec-tively, for a minimum-ED design. As shown in Fig. 15(b),this translates into a more favorable leakage-delay trade-off, with a 2.7 improvement in the leakage-delay product.For minimum-ED3 design, leakage of CP3L, CSP3L, andTGPL is 561.7, 685.7, and 832.5 nA, which translates intoa 5.4 improvement in the leakage-delay3 product.

  • CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES 1601

    (a)

    (b)

    Fig. 15. (a) Energy-delay tradeoff. (b) Leakage-delay tradeoff.

    B. Variations and Comparison With the State of the ArtThe above measurements were repeated on 256 replicas

    of each version of the considered pulsed latches (over fourdice). As an example, Fig. 16 reports the resulting histogramof the DQ delay for the CP3L in its minimum-ED andminimum-ED3 versions (CSP3L histograms are very similar).The variability of these and other parameters of interest issummarized in Tables II and III for the minimum ED andED3 designs.

    From Tables II and III, CP3L and CSP3L have 1.7lower standard deviation of DQ delay compared to theTGPL in minimum-ED design, whereas there is no signif-icant difference in the minimum-ED3 case. The compara-ble or smaller variations in CP3L and CSP3L translate intoa 1.4 worse variability / compared to TGPL, due tothe much lower average delay of CP3L and CSP3L. Thisvariability difference does not significantly affect the abovementioned speed advantage of CP3L and CSP3L. Indeed,the 3-sigma worst case value of their DQ delay is betterthan the TGPL counterpart by 1.4 to 1.9 dependingon the design target. This is close to the above results innominal corner (1.52). Hence, CP3L and CSP3L areconfirmed to be largely faster than TGPL in the presenceof variations.

    (a)

    (b)

    Fig. 16. Histogram of CP3L D-Q delay for (a) minimum-ED design and(b) minimum-ED3 design (256 measurements).

    CP3L and CSP3L have approximately the same variabil-ity as TGPL in regard to setup time and leakage fromTables II and III. On the other hand, CP3L and CSP3L havesimilar or 2 worse variability of CKQ delay, comparedto TGPL. From the perspective of VLSI systems timing,the above-discussed DQ delay variations are more impactfulthan CKQ delay variations. Indeed, from Tables II and III,CKQ variations are smaller than DQ delay variations. Inaddition, critical paths typically go through a DQ delay, ratherthan CKQ delay (late computations are finished during thetransparency window). As expected, energy variations werefound to be extremely small (1%), hence related resultsare omitted for brevity. From Tables II and III, CP3L andCSP3L also have 1.72.6 less variations in hold time,which translates into a proportionally lower number of buffersinserted by place and route tools at the timing closure designphase.

    For completeness, the proposed class of pulsed latches wasalso compared to other existing topologies that cover a muchwider range of applications, from very high performance tovery low energy. In addition to TGPL, we thus consideredSTFF for its very high performance [16], TGFF for its highenergy efficiency at moderate performance [17], and ACFFfor its high energy efficiency at low performance targets [18].The results of the comparison are summarized in Table IV,where data are normalized to the best, and the results from

  • 1602 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 7, JULY 2014

    TABLE IVAVERAGE AND STANDARD DEVIATION OF MAIN PARAMETERS OF INTEREST (256 REPLICAS, MIN.-E D3 DESIGN)

    simulations are based on post-layout extraction. From thistable, the proposed class of pulsed latches exhibits the lowestDQ delay, as it is 1.5 lower than that of the very fastSTFF (which is confirmed to be slightly faster than TGPLfor minimum-ED3 design). Also, CP3L and CSP3L largelyimprove energy efficiency at high-performance design targets(i.e., minimum ED3), compared to such high-speed topologies.Indeed, CP3L and CSP3L reduce ED3 by 2.2 and 2.8compared to TGPL and STFF, respectively. CP3L and CSP3Lexhibit a significantly better energy efficiency even comparedto topologies that are typically used for moderate to low-speed design targets. More specifically, from Table IV CP3Land CSP3L designed for minimum-ED improve the energy-delay product by 1.4 compared to the energy-efficient TGFFtopology, and by 1.9 compared to the ultralow energy ACFF.

    Summarizing, the proposed class of pulsed latches out-performs the state of the art in terms of pure performance,with DQ delay improvements in the order of 1.5 or more.In current power-limited VLSI systems, the more exploitableadvantage of CP3L and CSP3L is their high energy efficiency,as they outperform the state of the art by more than 2when compared to topologies targeting high speed. In addition,the proposed pulsed latches exhibit a better energy efficiency(1.41.9) even when compared to topologies targeting verylow energy.

    VIII. CONCLUSIONIn this paper, a new class of pulsed latches has been

    introduced. Its pushpull final stage and split paths in thefirst stage enable a significant reduction in path and parasiticeffort. Measurements on 65-nm test chip demonstrated a1.52 speed improvement compared to TGPL, which makesthe proposed topologies the fastest ever reported. At thebest of authors knowledge, for the first time the proposedlatches are validated through measurements on 256 replicas.Measurements confirm the above advantages in the presenceof variations.

    More importantly, the energy efficiency of the proposedpulsed latches enables a significant improvement beyond thestate of the art. Indeed, a 2.3 improvement over TGPL wasfound in terms of ED3 product, and a 1.3 improvement in theED product. The area penalty paid by the proposed latches is1.15 1.35 compared to TGPL, which is among the small-

    est existing latches. The proposed pulsed latches also exhibita better energy efficiency (1.41.9) compared to state-of-the-art topologies that target ultralow energy operation.

    Finally, the CP3L and CSP3L were shown to be equivalentin terms of energy and performance, hence both topologiesare equally worth considering when designing highly energy-efficient systems. The choice between CP3L and CSP3Lis driven by preliminary design decisions on the clockingscheme. Indeed, CP3L does not allow for sharing a pulsegenerator, but has lower area than CSP3L if the pulse generatoris included. Hence, CP3L is preferable when only a smallsubset of FFs needs to be replaced by a pulsed latch (i.e., inpipeline stages that have few critical paths, as might occur inrandom logic). Indeed, in this case latches tend to be far fromeach other, hence it does not make sense to share their pulsegenerator. On the other hand, CSP3L is preferable in systemswhere a significant number of FFs need to be replaced (e.g.,pipeline stages with many critical paths, as occurs in regularmodules).

    APPENDIXIn this appendix, transistor sizing for maximum speed (i.e.,

    minimum DQ delay) is analytically discussed for CP3L,CSP3L, and TGPL. In the following, all transistor chan-nel widths are normalized to the minimum allowed by thetechnology, PN ratio is equal to two, and capacitances arenormalized to the input capacitance of a minimum symmetricinverter (about 0.3 fF at 1 V in this technology). Also,transmission gates were sized with equally sized pMOS andnMOS transistors, keeper transistors are all minimum sizedto reduce dissipation, channel lengths are generally minimum,and series transistors are equally sized. Different sizing (e.g.,non-minimum channel length) and PN ratio was allowed inthe pulse generator to adjust the transparency window width,while ensuring pulses with symmetrical rise/fall time.

    A. Logical Effort Optimization of TGPLThe transistor sizes of TGPL to be optimized under the

    above assumptions are reported in Fig. 17. From this figure,the two independent sizes W1 and W2 need to be optimizedin the critical path.

    From Fig. 17, the first stage of the critical DQ pathcomprises the input inverter and the subsequent transmission

  • CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES 1603

    Fig. 17. Transistor sizes in TGPL.

    gate. From usual logical effort calculations, the timing ofthe first stage is characterized by the following logical effortparameters:

    g1 = 53 (A.1a)

    h1 = 3W2 + 43W1 (A.1b)

    p1 = 259 . (A.1c)Similarly, the second stage is a simple inverter and hence hasparasitic delay p2 = 1 and logical effort g2 = 1, whereas itselectrical effort immediately results to

    h2 = WL3W2 (A.2)

    being WL the load expressed as equivalent transistorwidth [22] (i.e., the transistor width such that its gate capequals the load capacitance CL in Fig. 17), normalized tothe minimum channel width. By setting g1h1 = g2h2, andneglecting the small contribution of minimum-sized transistorsconnected to the output of the first stage (i.e., transistors withnormalized width equal to 1 in Fig. 17), from (A.1) to (A.2)the optimum W2 that minimizes the DQ delay results toW2 = (WL W1/5)1/2. By substituting W2, (A.1)(A.2) leadto the following minimum achievable delay:

    Dmin,TGPL =

    53

    3W2 + 43W1

    WL3W2

    + 259

    + 1

    59

    WLW1

    + 349

    =

    53

    CLCin

    + 349

    (A.3)

    where we considered that the input capacitance Cin in Fig. 17is equal to the gate capacitance of a transistor with width 3W1,and the load capacitance CL is by definition the gate capaci-tance of a transistor with width WL .

    Finally, the detailed pulse generator sizing is very simpleand herein omitted, as transistors of the output NAND gatein Fig. 2 must be simply sized to ensure the targeted slope(i.e., rise/fall time) of signal CP. Commonly adopted valuesof the clock slope range from FO3 to FO4 [3], being FO Xthe slope of the output waveform of an inverter loaded by Xinverters with the same size. Subsequently, inverters are easilysized to obtain the targeted transparency window.

    Fig. 18. Transistor sizes in the critical path of CP3L and CSP3L.

    B. Logical Effort Optimization of CP3L and CSP3LThe critical DQ path of CP3L and CSP3L (they both

    have exactly the same path) with the related transistor sizesunder the above assumptions is shown in Fig. 18. From thisfigure, CP3L and CSP3L have two independent DQ pathswith two stages: the first stage is a half latch (top latch forfall path; bottom for rise), and the second is a transistor ofthe second pushpull stage (nMOS for fall path; pMOS forrise).

    For the fall path, logical effort analysis leads to

    g1,FALL = 43 (A.4a)

    h1,FALL = W2 + WPP R + 12W1 (A.4b)

    p1,FALL = 43 (A.4c)

    whereas the rise path has the following parameters:

    g1,RISE = 23 (A.5a)

    h1,RISE = 2W2 + WNP R + 1W1 (A.5b)

    p1,RISE = 23 . (A.5c)

    For the second stage, analysis for the fall path leads to

    g2,FALL = 13 (A.6a)

    h2,FALL = 4 + WLW2 (A.6b)p2,FALL = 1 (A.6c)

    whereas the rise path has

    g2,RISE = 23 (A.7a)

    h2,RISE = 4 + WLW2 (A.7b)p2,RISE = 1. (A.7c)

  • 1604 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 7, JULY 2014

    By imposing equal stage efforts, the minimum delay of thefall path is found to be

    Dmin,CP3L,FALL=

    2 (W2 + WPP R + 1)3W1

    4 + WL3W2

    + 73

    =

    2CL3Cin

    + 73

    (A.8)

    where we neglected the small capacitance associated with theminimum-sized transistors in the keeper in Fig. 18 (i.e., WL 4) and the capacitance of the small precharge transistors (i.e.,W2 WPP R+1). Under the same approximations

    Dmin,CP3L,RISE=

    2 (2W2 + WNP R + 1)3W1

    4 + WL3W2

    +53

    4CL3Cin

    + 53. (A.9)

    From comparison of (A.8)(A.9), the worst-case DQ delayof CP3L and CSP3L is the rise path delay, given by (A.9).

    ACKNOWLEDGMENTThe authors would like to thank the sponsors of the Berke-

    ley Wireless Research Center, STMicroelectronics, for chipfabrication, and Prof. D. Blaauw and D. Sylvester for testingsupport.

    REFERENCES[1] S. Naffziger and G. Hammond, The implementation of the next-

    generation 64b itanium microprocessor, in Proc. IEEE ISSCC,Feb. 2002, pp. 276504.

    [2] B. Dally, Architectures and circuits for energy-efficient computing, inProc. CICC, Sep. 2012, pp. 110.

    [3] M. Alioto, E. Consoli, and G. Palumbo, Flip-flop energy/performanceversus clock slope and impact on the clock network design, IEEE Trans.Circuits Syst., vol. 57, no. 6, pp. 12731286, Jun. 2010.

    [4] C. Giacomotto, N. Nedovic, and V. Oklobdzija, The effect of the systemspecification on the optimal selection of clocked storage elements, IEEEJ. Solid-State Circuit, vol. 42, no. 6, pp. 13921404, Jun. 2007.

    [5] T. Fischer, S. Arekapudi, E. Busta, C. Dietz, M. Golden, S. Hilker,A. Horiuchi, K. A. Hurd, D. Johnson, H. McIntyre, S. Naffziger, J. Vinh,J. White, and K. Wilcox, Design solutions for the Bulldozer 32nm SOI2-core processor module in an 8-core CPU, in IEEE ISSCC Dig. Tech.Papers, Feb. 2011, pp. 7880.

    [6] P. Gronowski, W. Bowhill, R. Preston, M. Gowan, and R. Allmon,High-performance microprocessor design, IEEE J. Solid-State Cir-cuits, vol. 33, no. 5, pp. 676686, May 1998.

    [7] D. Bailey and B. Benschneider, Clocking design and analysis for a600-MHz alpha microprocessor, IEEE J. Solid-State Circuits, vol. 33,no. 11, pp. 16271633, Nov. 1998.

    [8] S. Naffziger, High-performance processors in a power-limited world,in Proc. Symp. VLSI Circuits, Jun. 2006, pp. 9397.

    [9] (2011). International Technology Roadmap for Semiconductors [Online].Available: http://www.itrs.net

    [10] M. Alioto, E. Consoli, and G. Palumbo, Analysis and compari-son in the energy-delay-area domain of nanometer CMOS flip-flops:Part IMethodology and design strategies, IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 19, no. 5, pp. 725736, May 2011.

    [11] M. Alioto, E. Consoli, and G. Palumbo, Analysis and comparison inthe energy-delay-area domain of nanometer CMOS flip-flops: Part IIResults and figures of merit, IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 19, no. 5, pp. 737750, May 2011.

    [12] M. Alioto, E. Consoli, and G. Palumbo, From energy-delay metrics toconstraints on the design of digital circuits, Int. J. Circuit Theory Appl.,vol. 40, no. 8, pp. 815834, Aug. 2012.

    [13] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, and V. De,Comparative delay and energy of single edge-triggered and dual edge-triggered pulsed flip-flops for high-performance microprocessors, inProc. ISLPED, Aug. 2001, pp. 147152.

    [14] V. Stojanovic and V. Oklobdzija, Comparative analysis of master-slave latches and flip-flops for high-performance and low-power sys-tems, IEEE J. Solid-State Circuits, vol. 34, no. 4, pp. 536548, Apr.1999.

    [15] H. Partovi, Clocked storage elements, in Design of High-PerformanceMicroprocessor Circuits. Piscataway, NJ, USA: IEEE Press,pp. 207234, 2001.

    [16] N. Nedovic, V. Oklobdzija, and W. Walker, A clock skew absorbingflip-flop, in IEEE ISSCC Dig. Tech. Papers, Feb. 2003, pp. 342497.

    [17] D. Markovic, B. Nikolic, and R. Brodersen, Analysis and design oflow-energy flip-flops, in Proc. Int. Symp. Low Power Electron. Design,Aug. 2001, pp. 5255.

    [18] C. Teh, T. Fujita, H. Hara, and M. Hamada, A 77% energy-saving22-transistor single-phase-clocking D-flip-flop with adaptive-couplingconfiguration in 40nm CMOS, in IEEE ISSCC Dig. Tech. Papers,Feb. 2011, pp. 338340.

    [19] E. Consoli, M. Alioto, G. Palumbo, and J. Rabaey, Conditional pushpull pulsed latch with 726 fJops energy delay product in 65nm CMOS,in IEEE ISSCC Dig. Tech. Papers, Feb. 2012, pp. 482483.

    [20] H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K.Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Sat-sukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, A 1.3GHzfifth generation SPARC64 microprocessor, in Proc. DAC, Jun. 2003,pp. 702705.

    [21] M. Wieckowski, Y. M. Park, C. Tokunaga, D. W. Kim, Z. Foo,D. Sylvester, and D. Blaauw, Timing yield enhancement through softedge flip-flop based design, in Proc. CICC, Sep. 2008, pp. 543546.

    [22] I. Sutherland, B. Sproull, and D. Harris, Logical Effort. Designing FastCMOS Circuits. San Mateo, CA, USA: Morgan Kaufmann Publishers,1999.

    [23] M. Alioto, E. Consoli, and G. Palumbo, General strategies to designnanometer flip-flops in the energy-delay space, IEEE Trans. CircuitsSyst., vol. 57, no. 7, pp. 15831596, Jul. 2010.

    [24] R. Ho, K. W. Mai, and M. A. Horowitz, The future of wires, Proc.IEEE, vol. 89, no. 4, pp. 490504, Apr. 2001.

    [25] T. Lang, E. Musoll, and J. Cortadella, Individual flip-flops with gatedclocks for low power datapaths, IEEE Trans. Circuits Syst. II, AnalogDigits Signal Process., vol. 44, no. 6, pp. 507516, Jun. 1997.

    [26] S. Heo and K. Asanovic, Load-sensitive flip-flop characterization, inProc. CSW-VLSI, Apr. 2001, pp. 8792.

    [27] S. Heo, R. Krashinsky, and K. Asanovic, Activity-sensitive flip-flopand latch selection for reduced energy, IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 15, no. 9, pp. 10601064, Sep. 2007.

    [28] V. Oklobdzija, V. Stojanovic, D. Markovic, and N. Nedovic, DigitalSystem Clocking: High-Performance and Low-Power Aspects, NewYork, NY, USA: Wiley, 2003.

    [29] J. D. Warnock, Y. H. Chan, W. V. Huott, S. M. Carey, M. F. Fee,H. Wen, M. J. Saccamango, F. Malgioglio, P. J. Meaney, D. W. Plass,Y. Chan, M. D. Mayo, G. Mayer, L. J. Sigal, D. L. Rude, R. Averill,M. Wood, T. Strach, H. H. Smith, B. W. Curran, E. M. Schwarz,L. Eisen, D. Malone, S. Weitzel, P. K. Mak, T. J. McPherson, andC. F. Webb, A 5.2GHz microprocessor chip for the IBM zEnter-prise system, in Proc. IEEE ISSCC Dig. Tech. Papers, Feb. 2011,pp. 7072.

    [30] H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K.Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Sat-sukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, A 1.3GHzfifth generation SPARC64 microprocessor, in Proc. DAC, Jun. 2003,pp. 702705.

    [31] N. Nedovic, W. Walker, and V. Oklobdzija, A test circuit for measure-ment of clocked storage element characteristics, IEEE J. Solid-StateCircuits, vol. 39, no. 8, pp. 12941304, Aug. 2004.

    [32] S. G. Narendra and A. Chandrakasan, Leakage in Nanometer CMOSTechnologies, New York, NY, USA: Springer-Verlag, 2006.

  • CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES 1605

    Elio Consoli was born in Catania, Italy, in 1983.He received the masters degree in microelectronicengineering from the University of Catania, Catania,in 2008, and the Ph.D. degree from the Departmentof Electrical, Electronic and Information Engineer-ing, University of Catania, in 2012.

    He has been a Visiting Scholar with the BerkeleyWireless Research Center, UC Berkeley, Berkeley,CA, USA, in 2010. In 2011, he joined MaximIntegrated, Catania DC, as a Designer of analog andmixed-signal ICs for interface, switch and protec-

    tion products. He is the co-author of several scientific papers on referredinternational journals and conferences. His current research interests includeclocking strategies and energy-efficient design for high-performance and lowpower digital VLSI systems in nanometer CMOS technologies, as well as thedefinition of novel circuits and design techniques to be employed in ultra-low-power duty-cycled wireless sensor nodes.

    Gaetano Palumbo (F07) was born in Catania,Italy, in 1964. He received the Laurea degree inelectrical engineering and the Ph.D. degree from theUniversity of Catania, Catania, in 1988 and 1993,respectively.

    He conducts courses on electronic devices, elec-tronics for digital systems and basic electronics in1993. In 1994, he joined the Dipartimento Elet-trico Elettronico e Sistemistico, now Dipartimentodi Ingegneria Elettrica Elettronica e dei Sistemi withthe University of Catania, as a Researcher, becoming

    an Associate Professor, in 1998. Since 2000, he has been a Full Professorwith the same department. He is developing some the research activities incollaboration with STMicroelectronics of Catania. He was the co-author ofthree books CMOS Current Amplifiers and Feedback Amplifiers: Theory andDesign and Model and Design of Bipolar and MOS Current-Mode Logic(CML, ECL and SCL Digital Circuits) (Kluwer Academic Publishers, 1999,2001 and 2005) and a textbook on electronic device in 2005. He is theauthor of over 380 scientific papers on referred international journals (morethan 150) and in conferences. He is the co-author of several patents. Hisresearch has embraced digital circuits with emphasis on bipolar and MOScurrent-mode digital circuits, adiabatic circuits, and high-performance buildingblocks focused on achieving optimum speed within the constraint of lowpower operation. His current research interests include analog circuits withparticular emphasis on feedback circuits, compensation techniques, current-mode approach, and low-voltage circuits.

    Dr. Palumbo was served as an Associate Editor of the IEEE TRANSAC-TIONS ON CIRCUITS AND SYSTEMS PART I for the topic Analog Circuitsand Filters and digital circuits and systems from 1999 to 2001 and from 2004to 2005. From 2006 to 2007, he served as an Associate Editor of the IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS PART II. From 2008 to 2011,he served as an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITSAND SYSTEMS PART I. In 2005, he was one of the 12 panelists in thescientific-disciplinare area 09 - industrial and information engineering of theCommittee for Evaluation of Italian Research, which has the aim to evaluatethe Italian research from 2001 to 2003. In 2003, he received the DarlingtonAward. Since 2011, he has been a member of the Board of Governors of theIEEE CAS Society.

    Jan M. Rabaey (M83SM92F95) received thePh.D. degree in applied sciences from KatholiekeUniversiteit Leuven, Leuven, Belgium.

    He joined the Faculty with the Electrical Engi-neering and Computer Science Department, Univer-sity of California, Berkeley, CA, USA, in 1987,where he holds the Donald O. Pederson Distin-guished Professorship. He is currently the ScienticCo-Director with the Berkeley Wireless ResearchCenter, Berkeley, CA, USA, and the Director ofthe Berkeley Ubiquitous SwarmLab, Berkeley, CA,

    USA. His current research interests include the conception and implementationof next-generation integrated wireless systems.

    Prof. Rabaey has received a wide range of major awards. He is a memberof the Royal Flemish Academy of Sciences and Arts of Belgium.

    Massimo Alioto (M01SM07) was born in Bres-cia, Italy, in 1972. He received the Laurea (M.Sc.)degree in electronics engineering and the Ph.D.degree in electrical engineering from the Universityof Catania, Catania, Italy, in 1997 and 2001, respec-tively.

    He is an Associate Professor with the Departmentof Electrical and Computer Engineering, NationalUniversity of Singapore, Singapore. He was anAssociate Professor with the Department of Informa-tion Engineering, University of Siena, Siena, Italy. In

    2013, he was a Visiting Scientist with Intel Labs CRL, Hillsboro, OR, USA,on ultra-scalable microarchitectures. From 2011 to 2012, he was a VisitingProfessor with the University of Michigan, Ann Arbor, MI, USA, investigatingon active techniques for resiliency in near-threshold processors, error-awareVLSI design for wide energy scalability, and self-powered circuits. From 2009to 2011, he was a Visiting Professor with BWRC University of California,Berkeley, CA, USA, investigating on next-generation ultra-low power circuitsand wireless nodes. In 2007, he was a Visiting Professor with EPFL -Lausanne, Lausanne, Switzerland. He has authored or co-authored over 180publications on journals (60+, mostly IEEE Transactions) and conferenceproceedings. He is the co-author of two books Flip-Flop Design in NanometerCMOS - from High Speed to Low Energy (Springer, 2013) and Model andDesign of Bipolar and MOS Current-Mode Logic: CML, ECL and SCL DigitalCircuits (Springer, 2005). His current research interests include ultra-lowpower VLSI circuits, self-powered and wireless nodes, near-threshold circuitsfor green computing, error-aware and widely energy-scalable VLSI circuits,and circuit techniques for emerging technologies.

    Prof. Alioto was a member of the HiPEAC Network of Excellence (EU)and the MuSyC FCRP Center, USA. From 2010 to 2012, he was the Chairof the VLSI Systems and Applications Technical Committee of the IEEECircuits and Systems Society, for which he was a Distinguished Lecturerfrom 2009 to 2010 and a member of the DLP Coordinating Committee from2011 to 2012. He currently serves as an Associate Editor-in-Chief of theIEEE TRANSACTIONS ON VLSI SYSTEMS, and served as a Guest Editorof various journal special issues (including the issue on Ultra-Low VoltageCircuits and Systems for Green Computing published in 2012 on IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS PART II). He serves or hasserved as an Associate Editor of a number of journals (IEEE TRANSACTIONSON VLSI SYSTEMS, ACM Transactions on Design Automation of ElectronicSystems, IEEE TRANSACTIONS ON CAS - PART I, Microelectronics Journal,Integration The VLSI Journal, Journal of Circuits, Systems, and Computers,Journal of Low Power Electronics, and Journal of Low Power Electronicsand Applications). He was a Technical Program Chair of the ICECS in 2013,NEWCAS in 2012, and ICM in 2010 conferences, and a Track Chair in anumber of conferences (ICCD, ISCAS, ICECS, VLSI-SoC, APCCAS, ICM).

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 600 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description >>> setdistillerparams> setpagedevice