How to Speed-up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining

How to Speed-up Fault-Tolerant Clock Generationin VLSI Systems-on-Chip via Pipelining

Matthias Függer1, Andreas Dielacher2 and Ulrich Schmid1

1Vienna University of TechnologyEmbedded Computing Systems Group

{fuegger, s}@ecs.tuwien.ac.at

2RUAG SpaceVienna

[email protected]

2

Outline

1. Fault-tolerant SoCs2. Asynchronous fault-tolerant clock generation algorithm3. Making it faster4. Proving it correct5. FPGA implementation

Making SoCs fault-tolerant

System Level Approach

• replication of functional units• communication between units necessaryto maintain consistency

• problems are analogous to those ofreplicated state machines indistributed systems!

4

Fault-tolerant SoC needs Common Time

precision: at any t, π(t) bounded

tick(3) tick(4) tick(5)

tick(2) tick(3) tick(4) tick(5)

p

q

π(t) = 2 #ticks(Δ) = 3

accuracy: l(Δ) < #ticks in any Δ < u(Δ)

p

q

Common time eases (allows) solving problems ofreplica determinism (atomic broadcast).

q’s local clock domain

5

Clocking a fault-tolerant SoC

(-) single point of failure

(+) common time acrosschip (< 1 tick)

(+) no single point of failure

(-) no common time across chip synchronize overhead & metastability

(+) no single point of failure

(+) common time across chip (< small # of ticks)

classicalsynchronous SoC

GALSglobally coordinatedclock generation: DARTS

Fu1Data Bus Fu3

Fu2

Oscillator

Oscillator

Oscillator

Clo

ck

Tree

Oscillator

Fu1

Data Bus Fu3

Fu2

TG-AlgsFu1

Data Bus

Fu3

Fu2

TG-Net

DARTS High-level Algorithm

(1) Initially:(2) send tick(1) to all; clock:= 1;(3) If received tick(m) from at least f+1 remote nodes and m > clock:(4) send tick(clock+1),…, tick(m) to all; clock:= m;(5) If received tick(m) from at least 2f+1 remote nodes and m >= clock:(6) send tick(m+1) to all; clock:= m+1;

6

n = 5, f = 1

k

k+1k

TQS

DARTS Hardware Implementation

clk_out

Counter Module 1

Node premote clk_in

Remote Inputs rrem

Threshold Modules

...f+1

2f+1

...

TickGen

Local Inputs rloc

Counter Module n-1

rrem rloc

Counter Module 2

rrem rlocCounter Module 3

rrem rlocCommon time property proved in [EDCC06].


7

Provides them > clock

andm >= clock

status

DARTS Performance

8

Performance

Obtained frequency: 1/Δ, depends on end-to-end delay Δ

Δthe lock step-case (Δ = 1)

k k+1 k+2

ΔCommon time property proved in [EDCC06].

Making DARTS faster: Pipelining

9

Pipelined Performance

Idea: Let tick k+X+1 depend on tick k.Obtained frequency: (X+1)/Δ, maximum depends on local delays

Δthe lock step-case (Δ = 1)

k k+X+1 k+2X+2

X+1 ticks X+1 ticks

X = 4Δ

Making DARTS faster: Algorithm Adaptations

10


(1) Initially:(2) send tick(1), ..., tick(X+1) to all; clock:= X+1;(3) If received tick(m) from at least f+1 remote nodes and m > clock:(4) send tick(clock+1),…, tick(m) to all; clock:= m;(5) If received tick(m) from at least 2f+1 remote nodes and m + X >= clock:(6) send tick(m+1) to all; clock:= m+1;

not changed

allows sending k+X+1 based

on kSmall change in algorithm

Is pDARTS correct?!

11

n = 5, f = 1

k-X

k+1k-X

TQS


easy to prove in classical systems (synchronous, Θ - model)

pDARTS Hardware Implementation

12


clk_out

Counter Module 1

Node p

remote clk_in

Remote Inputs rrem

Threshold Modules

...f+1

2f+1

...

TickGen

Local Inputs rloc

Counter Module n-1

Counter Module 2

Counter Module 3

Remote Inputs rrem Local Inputs rlocm > clock

m +X >= clock


status

Provides them + X >= clock

status

pDARTS Hardware Implementation

13

C

C

C

C

Rremote,in

C

C

C

C

Rlocal,in

NAND2

NOR2

NOR1

NAND4

GEQe

GEQo

Counter Module 3f+1 of 3f+1

Local PipeDiff-GateRemote Pipe

Pipe Compare Signal Gen.

...

...

≥2f+1 ≥2f+1

≥f+1 ≥f+1

......

......

Threshold Gates____GEQe

___GRe

____GEQo

___GRo

...

3f+1

...

Ctop

LocalClk

RemoteClk

r s

Pipe Compare Signal Gen. (GEQ)

Diff-Gate

Local PipeRemote Pipe

Counter Module 1 of 3f+1

C

Tick Generation

LocalClk_self

C

C

C

C

Rremote,in

C

C

C

C

Rlocal,in

Ctop

Diff-Gate


NOR2

NOR1

NAND3

NAND5

GRe

GRo

Pipe Compare Signal Gen. (GR)


status

Provides them + X >= clock

status

Is pDARTS still correct?!

14

Correctness Proof

• High-level algorithm, yes. (proof-gap)• Low-level pDARTS, has far more complex proofs than DARTS,

& queuing effects inside Counter Modules not neglected formal framework tied to hardware,

therein prove it correct.

C

C

C

C

Rremote,in

C

C

C

C

Rlocal,in

NAND2

NOR2

NOR1

NAND4

GEQe

GEQo




...

...

≥2f+1 ≥2f+1

≥f+1 ≥f+1

......

......


___GRe

____GEQo

___GRo

...

3f+1

...

Ctop

LocalClk

RemoteClk

r s


Diff-Gate



C

Tick Generation

LocalClk_self

C

C

C

C

Rremote,in

C

C

C

C

Rlocal,in

Ctop

Diff-Gate


NOR2

NOR1

NAND3

NAND5

GRe

GRo


The formal Framework

15

Ingredients

• Classical models: step-based (state machines)

• Modules with signal ports• Signal’s behavior specified by

• event trace: (t,x) in S• status function: S(t) = x• counting function: #S(t) = k

• Basic/Compound modules

• their behavior is specified byrelations on the port behavior

[Δ-, Δ+], initially 0I O

The formal Framework

16

Diff-Gate Module (Counting Function model)

When to remove tick k from both local and remote pipe:

C

C

C

C

Rremote,in

C

C

C

C

Rlocal,in

NAND2

NOR2

NOR1

NAND4

GEQe

GEQo




...

...

≥2f+1 ≥2f+1

≥f+1 ≥f+1

......

......


___GRe

____GEQo

___GRo

...

3f+1

...

Ctop

LocalClk

RemoteClk

r s


Diff-Gate



C

Tick Generation

LocalClk_self

C

C

C

C

Rremote,in

C

C

C

C

Rlocal,in

Ctop

Diff-Gate


NOR2

NOR1

NAND3

NAND5

GRe

GRo


For k = 0: If,(1) received tick 1 in remote pipe at t, and(2) received tick 1 in local pipe at t’, remove tick 0 from both pipes within max(t,t’) + [Δ-

Diff , Δ+Diff]

For k > 0: If,(1) received tick k+1 in remote pipe at t,(2) received tick k+1 in local pipe at t’, and(3) removed tick k-1 at t’’, remove tick k from both pipes within max(t,t’,t’’) + [Δ-

Diff , Δ+Diff]

Active signal only if exactly 1 tickin local pipe

Proof Results

17

Precision

AccuracyL(t2-t1) ≤ #ticks in any (t2-t1) ≤ U(t2-t1)

Bounded Queue Sizesdepends on

FPGA prototype implementation

18

X = 0 (conventional DARTS)

maximum of X = 4 (stabilizes)

APEX EP20K1000 FPGA

Slow Δ compared to Δloc

Δ about 125nsΔloc about 25ns

Conclusions

• Replication to make fault-tolerant.• Clocking a replicated state machine is non-trivial, but possible.• Unfortunately: slow!• Apply pipelining idea to make it faster.• Formal analysis with hardware inspired formal framework.• Proved it correct & implemented FPGA prototype.

19

clk_out

Counter Module 1

Node p

remote clk_in

Remote Inputs rrem

Threshold Modules

...

f+1

2f+1...

TickGen

Local Inputs rloc

Counter Module n-1

Counter Module 2

Counter Module 3

Remote Inputs rrem Local Inputs rlocm > clock

m +X >= clock

Spreading effect of Ticks

C

C

C

C

Rremote,in

C

C

C

C

Rlocal,in

NAND2

NOR2

NOR1

NAND4

GEQe

GEQo




...

...

≥2f+1 ≥2f+1

≥f+1 ≥f+1

......

......


___GRe

____GEQo

___GRo

...

3f+1

...

Ctop

LocalClk

RemoteClk

r s


Diff-Gate



C

Tick Generation

LocalClk_self

C

C

C

C

Rremote,in

C

C

C

C

Rlocal,in

Ctop

Diff-Gate


NOR2

NOR1

NAND3

NAND5

GRe

GRo


21

tends to spread out ticks evenlyafter an initial phase

How to Speed-up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining

Documents