7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current erroneous state and determines the correct state without any loss of computation . There are two different approaches: a) H ardware R edundancy – Static Redundancy – Dynamic Redundancy b) S oftware R edundancy
23
Embed
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
7. Fault Tolerance Through Dynamic or Standby Redundancy
7.5 Forward Recovery Systems
Upon the detection of a failure, the system discards the current erroneous state and determines the correct state without any loss of computation.
There are two different approaches:
a) Hardware Redundancy – Static Redundancy– Dynamic Redundancy
b) Software Redundancy
7. Fault Tolerance Through Dynamic or Standby Redundancy
E.g. in a given system, each subsystem is duplicated,
forming a pair. One of the replicas is identified as the spare. Each subsystem and its spare are, themselves, made self-checking by replication. The HW is thereby replicated 4 times. All 4 copies of the HW are tightly synchronized. When a fault is detected in a subsystem by its self-checking mechanisms, it disconnects itself as well as that the spare starts providing its service without any interruption or rollback.
7. Fault Tolerance Through Dynamic or Standby Redundancy
E.g. the reconfigurable duplication mechanism, where
the process is replicated on 2 processors. Their outputs are continuously compared. If any mismatch indicating a failure of at least one of the processors in the pair is detected, each processor runs self-diagnostic tasks to determine if it has failed. Once the faulty processor is identified, the output of the fault-free processor can be accepted as correct.
7. Fault Tolerance Through Dynamic or Standby Redundancy
The use of self-diagnostic tasks instead of concurrent self-checking results in a slight computation overhead for determining
the faulty processor after a fault is detected.
7. Fault Tolerance Through Dynamic or Standby Redundancy
Forward recovery schemes based on dynamic redundancy and checkpointing try to avoid rollback even in the presence of failures. The fault is thus tolerated without the performance penalty of a rollback.
E.g. Consider a duplex system that detects failures by checkpointing the two modules in the system periodically and then, comparing their states.
When a failure is detected, the roll-forward checkpointing scheme tries to determine which of the two processing modules, if any, is fault-free.
7. Fault Tolerance Through Dynamic or Standby Redundancy
Optimistic (only single faults) Roll-forward (I) Roll-forward (I)
Rollback (I)*
Pessimistic (may occur double faults) Roll-forward (II) Rollback (II)
Three Different Recovery Schemes (* no built-in fault detection capability included).
Variations of the RFCS may assume that each module has built-in fault detection capability such as parity checks, exception detection. Thus, 4 different scenarios can be conceptualized:
7. Fault Tolerance Through Dynamic or Standby Redundancy
. In the pessimistic recovery strategy, It may be noted that although module B has been already suspect to be faulty, a more conservative action was taken just in case A might have experienced a failure which escaped the built-in detection capability during I1.
Pessimistic Scheme with spare rolling forward with all single faults.
Pessimistic Scheme with spare rolling back with double faults.
7. Fault Tolerance Through Dynamic or Standby Redundancy
The ideal curve 1 is preferred because it allows a small reduction in reliability to be traded off against a large gain in performance. (This is the case of Optimistic Recovery Strategies).
7. Fault Tolerance Through Dynamic or Standby Redundancy
Generally, the mean completion time given a failure has occurred is lower for the roll-forward scheme for both optimistic and pessimistic strategies.
Without any failure, all the schemes perform similarly.
When there is no built-in detection capability, the pessimistic and the corresponding optimistic scheme have identical reliabilities. Since there is no built-in detection, there is no way to identify the faulty module without comparison between operating modules and the spare one.
When there is 100% fault detection, with or without spare schemes have identical reliabilities.
7. Fault Tolerance Through Dynamic or Standby Redundancy