3D-DRESD FT

POLITECNICO DI MILANO

Vincenzo Rana

[email protected]

Fault tolerance inFault tolerance inFPGA-based systemsFPGA-based systems

2

OutlineOutline

Techniques:Triple module redundancy

Throughput logicState-machine logicI/O logicBRAM

Error detection and error correctionPartial reconfiguration

Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration

Conclusions

3

OutlineOutline





Conclusions

4

Triple module redundancyTriple module redundancy

5

Triple module redundancy Triple module redundancy (voter)(voter)

The voter can be implementedwith Look-Up Tables (LUTs)with buffer 3-state (BUFT)

6

OutlineOutline





Conclusions

7

Throughput logicThroughput logic

The system will include 3 copies of:the module itselfthe input signalsthe output signals

No voter is needed

No single-point-of-failure

8

OutlineOutline





Conclusions

9

State-machine logicState-machine logic

State-machines strictly depend on their stateThe voter has to be implemented internally

A voter has to be inserted in the system for:each state registereach feedback path

This approach allows to keep each state-machine always in the correct state

10

OutlineOutline





Conclusions

11

I/O logic (Input)I/O logic (Input)

Input pins have to be replicated in order to avoid single-points-of-failureIf the number of required input pins exceeds the number of input pins available on the reconfigurable devices:

Just a subset of input pins can be replicatedThe system can be split in more than one FPGA

12

I/O logic (Output)I/O logic (Output)

In order to avoid a single-point-of-failure on output pins it is necessary to implement the following circuit

13

OutlineOutline





Conclusions

14

BRAMBRAM

BRAMs are large block of static memory (4K bits each) that are true dual port and fully synchronousTechniques:

Simple redundancyReplication of BRAMs

Redundancy and refreshReplication of BRAMsRefresh with voter

Data encryptionError Correction Control (ECC)

15

OutlineOutline





Conclusions

16

Error detection and error Error detection and error correctioncorrection

It is more performance and cost effective to correct and error rather than retransmit the dataParity data are added to true data (64+8 or 32+7)No memory replication

17

OutlineOutline





Conclusions

18

Partial reconfigurationPartial reconfiguration

Access to the configuration memory:Readback

Post-configuration read operation

Partial reconfigurationPost-configuration write operation

Techniques:SEU scrubbing

Partial reconfiguration

SEU detectionReadback

Bit for bit comparisonCRC comparison

SEU correctionReadbackPartial reconfiguration

19

OutlineOutline





Conclusions

20

Dynamic partial Dynamic partial reconfigurationreconfiguration

Dynamic partial reconfiguration can be useful to trigger the reconfiguration of the affected portion of the architecture

while the rest of the system is still workingwithout need to perform a complete reconfiguration

It can be very useful to reconfigure the smallest portion of the FPGA where the fault is located (a good partitioning phase is needed)

Solution space exploration has to be performed

21

Dynamic partial reconfiguration Dynamic partial reconfiguration (DWC)(DWC)

Fault detection and characterizationIdentification of a mismatch

Fault localizationIdentification of the portion of the device where the fault is located

Several solutions with applying DWC

22

Dynamic partial reconfiguration (ro-index)Dynamic partial reconfiguration (ro-index)

ro-index: the ratio between the occupied area and its minimal placement constraint, both computed in slices

Occupied area in Slices: So

Placement constraint in Slices: Sc

ro-index = So / Sc

23

OutlineOutline





Conclusions

24

Run-time fault reconfigurationRun-time fault reconfiguration

Recovery from permanent logic and interconnect faults

fine-grained physical design partitioning

Faults are localized to small partitioned blocks that have fixed interfaces to the surrounding portion of the device

affected block are reconfigured with previously generated, functionally equivalent block instances that do not use the faulty resources

25

Run-time fault reconfigurationRun-time fault reconfiguration

AssumptionsDetection of a faultLocalization of a faultDiagnosis of a fault (just helpful, not necessary)

ActionAn alternate configuration of the design can be loaded that does not utilize the faulty resources

Advantagesextremely low area overheadvery low timing overheadrun-time management of faultshigh flexibility

Disadvantagesvery complex design phase (and run-time management)

26

OutlineOutline





Conclusions

27

ConclusionsConclusions

Reliable systems can be effectively implemented on FPGA devices

The previously presented techniques can be combined together in order to improve the overall reliability of the whole design

TMR combined with SEU correction through partial reconfiguration is a powerful and effective SEU migration strategy

3-state buffer can be used in order to implement fault tolerance methodologies without wasting LUTs (keeping low the area overhead)

28

The endThe end

•Thank you for your attention

•Do you have any questions?

3D-DRESD FT

Technology

partial reconfiguration

io logic output

io logic input input

triple module redundancy

correct state

state buft

operation techniques

bram brams