POLITECNICO DI MILANO Vincenzo Rana [email protected] Fault tolerance in Fault tolerance in FPGA-based systems FPGA-based systems
POLITECNICO DI MILANO
Vincenzo Rana
Fault tolerance inFault tolerance inFPGA-based systemsFPGA-based systems
2
OutlineOutline
Techniques:Triple module redundancy
Throughput logicState-machine logicI/O logicBRAM
Error detection and error correctionPartial reconfiguration
Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration
Conclusions
3
OutlineOutline
Techniques:Triple module redundancy
Throughput logicState-machine logicI/O logicBRAM
Error detection and error correctionPartial reconfiguration
Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration
Conclusions
4
Triple module redundancyTriple module redundancy
5
Triple module redundancy Triple module redundancy (voter)(voter)
The voter can be implementedwith Look-Up Tables (LUTs)with buffer 3-state (BUFT)
6
OutlineOutline
Techniques:Triple module redundancy
Throughput logicState-machine logicI/O logicBRAM
Error detection and error correctionPartial reconfiguration
Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration
Conclusions
7
Throughput logicThroughput logic
The system will include 3 copies of:the module itselfthe input signalsthe output signals
No voter is needed
No single-point-of-failure
8
OutlineOutline
Techniques:Triple module redundancy
Throughput logicState-machine logicI/O logicBRAM
Error detection and error correctionPartial reconfiguration
Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration
Conclusions
9
State-machine logicState-machine logic
State-machines strictly depend on their stateThe voter has to be implemented internally
A voter has to be inserted in the system for:each state registereach feedback path
This approach allows to keep each state-machine always in the correct state
10
OutlineOutline
Techniques:Triple module redundancy
Throughput logicState-machine logicI/O logicBRAM
Error detection and error correctionPartial reconfiguration
Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration
Conclusions
11
I/O logic (Input)I/O logic (Input)
Input pins have to be replicated in order to avoid single-points-of-failureIf the number of required input pins exceeds the number of input pins available on the reconfigurable devices:
Just a subset of input pins can be replicatedThe system can be split in more than one FPGA
12
I/O logic (Output)I/O logic (Output)
In order to avoid a single-point-of-failure on output pins it is necessary to implement the following circuit
13
OutlineOutline
Techniques:Triple module redundancy
Throughput logicState-machine logicI/O logicBRAM
Error detection and error correctionPartial reconfiguration
Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration
Conclusions
14
BRAMBRAM
BRAMs are large block of static memory (4K bits each) that are true dual port and fully synchronousTechniques:
Simple redundancyReplication of BRAMs
Redundancy and refreshReplication of BRAMsRefresh with voter
Data encryptionError Correction Control (ECC)
15
OutlineOutline
Techniques:Triple module redundancy
Throughput logicState-machine logicI/O logicBRAM
Error detection and error correctionPartial reconfiguration
Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration
Conclusions
16
Error detection and error Error detection and error correctioncorrection
It is more performance and cost effective to correct and error rather than retransmit the dataParity data are added to true data (64+8 or 32+7)No memory replication
17
OutlineOutline
Techniques:Triple module redundancy
Throughput logicState-machine logicI/O logicBRAM
Error detection and error correctionPartial reconfiguration
Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration
Conclusions
18
Partial reconfigurationPartial reconfiguration
Access to the configuration memory:Readback
Post-configuration read operation
Partial reconfigurationPost-configuration write operation
Techniques:SEU scrubbing
Partial reconfiguration
SEU detectionReadback
Bit for bit comparisonCRC comparison
SEU correctionReadbackPartial reconfiguration
19
OutlineOutline
Techniques:Triple module redundancy
Throughput logicState-machine logicI/O logicBRAM
Error detection and error correctionPartial reconfiguration
Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration
Conclusions
20
Dynamic partial Dynamic partial reconfigurationreconfiguration
Dynamic partial reconfiguration can be useful to trigger the reconfiguration of the affected portion of the architecture
while the rest of the system is still workingwithout need to perform a complete reconfiguration
It can be very useful to reconfigure the smallest portion of the FPGA where the fault is located (a good partitioning phase is needed)
Solution space exploration has to be performed
21
Dynamic partial reconfiguration Dynamic partial reconfiguration (DWC)(DWC)
Fault detection and characterizationIdentification of a mismatch
Fault localizationIdentification of the portion of the device where the fault is located
Several solutions with applying DWC
22
Dynamic partial reconfiguration (ro-index)Dynamic partial reconfiguration (ro-index)
ro-index: the ratio between the occupied area and its minimal placement constraint, both computed in slices
Occupied area in Slices: So
Placement constraint in Slices: Sc
ro-index = So / Sc
23
OutlineOutline
Techniques:Triple module redundancy
Throughput logicState-machine logicI/O logicBRAM
Error detection and error correctionPartial reconfiguration
Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration
Conclusions
24
Run-time fault reconfigurationRun-time fault reconfiguration
Recovery from permanent logic and interconnect faults
fine-grained physical design partitioning
Faults are localized to small partitioned blocks that have fixed interfaces to the surrounding portion of the device
affected block are reconfigured with previously generated, functionally equivalent block instances that do not use the faulty resources
25
Run-time fault reconfigurationRun-time fault reconfiguration
AssumptionsDetection of a faultLocalization of a faultDiagnosis of a fault (just helpful, not necessary)
ActionAn alternate configuration of the design can be loaded that does not utilize the faulty resources
Advantagesextremely low area overheadvery low timing overheadrun-time management of faultshigh flexibility
Disadvantagesvery complex design phase (and run-time management)
26
OutlineOutline
Techniques:Triple module redundancy
Throughput logicState-machine logicI/O logicBRAM
Error detection and error correctionPartial reconfiguration
Real approachesSEU migration through dynamic partial reconfigurationRun-time fault reconfiguration
Conclusions
27
ConclusionsConclusions
Reliable systems can be effectively implemented on FPGA devices
The previously presented techniques can be combined together in order to improve the overall reliability of the whole design
TMR combined with SEU correction through partial reconfiguration is a powerful and effective SEU migration strategy
3-state buffer can be used in order to implement fault tolerance methodologies without wasting LUTs (keeping low the area overhead)
28
The endThe end
•Thank you for your attention
•Do you have any questions?