7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems

Upon the detection of a failure, the system discards the current erroneous state and determines the correct state without any loss of computation.

There are two different approaches:

a) Hardware Redundancy – Static Redundancy– Dynamic Redundancy

b) Software Redundancy


7.5 Forward Recovery Systems– 7.5.1 Static Redundancy Approaches

There are 3 different approaches to mask the failures:

Active Masking RedundancyActive Masking Redundancy

Active Masking Using Fail-Stop ModulesActive Masking Using Fail-Stop Modules

Active Redundancy Using Self-DiagnosisActive Redundancy Using Self-Diagnosis



Active Masking RedundancyActive Masking Redundancy:

Uses adequate level of replication to tolerate the failures, using voting on the outputs of all the replicas.

E.g.: TMR (Triple Modular Redundant) systems mask a single failure without any performance loss.


Active Redundancy Using Fail-Stop ModulesActive Redundancy Using Fail-Stop Modules:

Multiple modules of each processor actively execute each

process. Each processor itself is assumed to be fail-

stop. Thus, if one of the processors fails, it stops

executing and the other processors executing the task

continue functioning without any performance penalty,

even in the presence of failures.



E.g. in a given system, each subsystem is duplicated,

forming a pair. One of the replicas is identified as the spare. Each subsystem and its spare are, themselves, made self-checking by replication. The HW is thereby replicated 4 times. All 4 copies of the HW are tightly synchronized. When a fault is detected in a subsystem by its self-checking mechanisms, it disconnects itself as well as that the spare starts providing its service without any interruption or rollback.



Active Redundancy Using Self-DiagnosisActive Redundancy Using Self-Diagnosis:

Analogous to the one using “fail-stop modules”, however,

instead of concurrent self-checking mechanism, self-

diagnosis tasks are used to identify the faulty

processor.



E.g. the reconfigurable duplication mechanism, where

the process is replicated on 2 processors. Their outputs are continuously compared. If any mismatch indicating a failure of at least one of the processors in the pair is detected, each processor runs self-diagnostic tasks to determine if it has failed. Once the faulty processor is identified, the output of the fault-free processor can be accepted as correct.


The use of self-diagnostic tasks instead of concurrent self-checking results in a slight computation overhead for determining

the faulty processor after a fault is detected.


7.5 Forward Recovery Systems– 7.5.2 Dynamic Redundancy Approaches

Forward recovery schemes based on dynamic redundancy and checkpointing try to avoid rollback even in the presence of failures. The fault is thus tolerated without the performance penalty of a rollback.

E.g. Consider a duplex system that detects failures by checkpointing the two modules in the system periodically and then, comparing their states.

When a failure is detected, the roll-forward checkpointing scheme tries to determine which of the two processing modules, if any, is fault-free.



Concurrent retry in the

Roll Forward

Checkpointing Scheme

(RFCS) Scheme.



Concurrent retry in the

Roll Forward

Checkpointing Scheme

(RFCS) Scheme.



Recovery StrategyResources Used

With Spare No Spare

Optimistic (only single faults) Roll-forward (I) Roll-forward (I)

Rollback (I)*

Pessimistic (may occur double faults) Roll-forward (II) Rollback (II)

Three Different Recovery Schemes (* no built-in fault detection capability included).

Variations of the RFCS may assume that each module has built-in fault detection capability such as parity checks, exception detection. Thus, 4 different scenarios can be conceptualized:



Optimistic scheme with or without spare.

Roll-forward (I)

I1 I2

Module

A

I1 I2

B

roll-forward

In an optimistic recovery strategy, one trusts the built-in detection capability

to the fullest extent. This scheme will not require the use of a spare, even

though it may be available.



Pes

sim

isti

c s

ch

e mes

. In the pessimistic recovery strategy, It may be noted that although module B has been already suspect to be faulty, a more conservative action was taken just in case A might have experienced a failure which escaped the built-in detection capability during I1.

Pessimistic Scheme with spare rolling forward with all single faults.

Pessimistic Scheme with spare rolling back with double faults.



Three different roll-forward schemes.

Performance

Reliability

1

2

3

The ideal curve 1 is preferred because it allows a small reduction in reliability to be traded off against a large gain in performance. (This is the case of Optimistic Recovery Strategies).



Generally, the mean completion time given a failure has occurred is lower for the roll-forward scheme for both optimistic and pessimistic strategies.

Without any failure, all the schemes perform similarly.

When there is no built-in detection capability, the pessimistic and the corresponding optimistic scheme have identical reliabilities. Since there is no built-in detection, there is no way to identify the faulty module without comparison between operating modules and the spare one.

When there is 100% fault detection, with or without spare schemes have identical reliabilities.



Note:

= failure rate;

c = detection coverage (indicates the degree of built- in detection capabilities);

n = # of checkpoint intervals.



Performance comparison between optimistic and pessimistic schemes: mean completion time, given a fault.

(Optimistic scheme is better)

Reliability comparison between optimistic and pessimistic schemes.

(Pessimistic scheme is better)

RollbackOptimistic

Roll-forward

Pessimistic



Permanent delay in rollback scheme outputs in the event of a fault.

One of the important advantages of a roll-forward scheme is in the minimal degradation in I/O performance:

All outputs after I1 will experience one checkpoint interval delay.



The outputs x and y are the only ones delayed and all other outputs are will occur at the regularly scheduled interval.

Temporary delay in roll-forward scheme outputs in the event of a fault.

I1 I2

Module

A

B

Spare Release

I3 I4 I5 I6

I1 I2Spare Activated

I1 I2 I3 I4 I5 I6

x,y,z w v : System outputs



Forward Recovery Using Checkpointing.


7.5 Forward Recovery Systems– 7.5.3 Software Redundancy-Based Approach for Forward Error

Recovery

The previous approaches primarily require HW redundancyHW redundancy (+300%+300%).

This approach requires a certain degree of SW redundancySW redundancy, as well as HW redundancyHW redundancy:

SW redundancy is implemented by using Recovery BlocksRecovery Blocks. Recovery

blocks are a language construct that supports the incorporation of

program redundancyprogram redundancy into a fault-tolerant program in a concise and easily

readable form.



Recovery

The syntax of the recovery block is:

Ensure Ensure TT by by BB11 else by else by BB22 ......else by else by BBnn

else else errorerror

Where: Where: TT is acceptance test; is acceptance test; BB11 denotes the primary try block; denotes the primary try block;

BBkk denotes the (k – 1)th alternate try block. denotes the (k – 1)th alternate try block.



Recovery

Distributed Recovery Block.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

Documents

modules active redundancy

static redundancy approaches

standby redundancy slide

modules active masking

fault tolerance

forward recovery systems

selfdiagnosis slide

faultfree processor