Top Banner
Hajvery University
29
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fault Tolerance System

Hajvery University

Page 2: Fault Tolerance System

Inaam Ilahi

Furqan Farooq

Sabuha Sarwar

Ehsan Ilahi

Group Members

Page 3: Fault Tolerance System

An Introduction to

Fault Tolerant System

In Software Reliabilities

Page 4: Fault Tolerance System

Faults, Errors and Failures

Fault is a defect within the systemError is observed by a deviation from the expected behavior of the systemFailure occurs when the system can no longer perform as required (does not meet specification) Fault Tolerance is ability of system to provide a service, even in the presence of errors

Fault Error Failure

Page 5: Fault Tolerance System

A fault tolerant system is a system which is a able to continue operating despite the failure of a limited subset of their hardware or software.

They are gracefully degradable i.e. as the size of the faulty set increases, the system wont collapse suddenly but continue executing, part of its workload.

The goal of this design is to ensure that the probability of system failure is acceptably small.

Fault Tolerant System

Page 6: Fault Tolerance System

Fault Types

Hardware Fault: A hardware fault is some physical defect that can cause a

component to malfunction. E.g. A broken wire or the output of a logic gate that is

perpetually stuck at some logic value(0 or 1).

Software Fault: A software fault is bug that can cause the program to fail

for a given set of inputs.

Page 7: Fault Tolerance System

Type of failure DescriptionCrash failure A server halts, but is working correctly until it haltsOmission failure Receive omission Send omission

A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages

Timing failure A server's response lies outside the specified time interval

Response failure Value failure State transition failure

The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control

Arbitrary failure A server may produce arbitrary responses at arbitrary times

Types Of Failure

Page 8: Fault Tolerance System

Objectives Of Fault Tolerance Availability system always ready for use, or probability that system is ready or available at a given time Reliability property that a system can run without failure, for a given time Safety Indicates the safety issues in the case the system fails Maintainability refers to the ease of repair to a failed system

Page 9: Fault Tolerance System

Miscommunication. Changing Requirements Poorly Documented Code Lack Of Skilled Testing

Causes For Faults

Page 10: Fault Tolerance System

Miscommunication

Success of any software application depends on communication between stakeholders, development and testing teams.

Page 11: Fault Tolerance System

Changing Requirements

The customer may not understand the effects of changes, or may understand and request them anyway .

Page 12: Fault Tolerance System

Poorly Documented Code

It’s tough to maintain and modify code that is badly written or poorly documented.

Page 13: Fault Tolerance System

Lack of Skilled Testing

No tester would want to accept it but let’s face it; poor testing do take place across organizations. There can be shortcomings in the testing process that are followed.

Page 14: Fault Tolerance System

Fault And Error Containment

The process of preventing the error spreading from one part to another part of the system is called containment

When a fault or error occurs in one part of a system, it will spread through the system like an infectious disease. e.g. An fault in one part of the system might cause large voltage swings in another.

A fault-free processor can give erroneous results,

when getting input from a faulty unit.

Page 15: Fault Tolerance System

RecoveryOnce failure has occurred in many cases it is important to recover critical processes to a known state in order to resume processingProblem is compounded in distributed systems Two Approaches: Backward recovery, by use of check pointing (global snapshot of distributed system status) to record the system state but check pointing is costly (performance degradation)

Forward recovery, attempt to bring system to a new stable state from which it is possible to proceed (applied in situations where the nature if errors is known and a reset can be applied)

Page 16: Fault Tolerance System

Fault Tolerance in Distributed Systems System attributes: · Availability – system always ready for use, or probability that

system is ready or available at a given time · Reliability – property that a system can run without failure, for

a given time · Safety – indicates the safety issues in the case the system fails · Maintainability – refers to the ease of repair to a failed system  Failure in a distributed system = when a service cannot

be fully provided System failure may be partial A single failure may affect other parts of a system (failure

escalation)

Page 17: Fault Tolerance System

Lost Request Messages when Server Crashes

A server in client-server communication• Normal case• Crash after execution • Crash before execution

Page 18: Fault Tolerance System

REDUNDANCY

FTS consist of properly managed redundancy, i.e. the system is to kept running despite the failure of some its parts. It must have spare capacity to begin with.

TYPES OF REDUNDANCY

Hardware redundancy Software redundancy Time redundancy Information redundancy

Page 19: Fault Tolerance System

Software Redundancy

Software faults are not like hardware faults i.e. software never wears out , the faults are not generated spontaneously during system operation.

Software faults can be regarded as faults in design.

For software redundancy simply replicating the same software N times will not work, all N copies will fail for the same inputs.

Instead N versions of the software can be implemented. The N versions can be developed by independent teams, with no contact between them.

Page 20: Fault Tolerance System

Each version is being developed by a team of developers who never communicated with each other

To minimize the common mode failures

The specifications should be written in formal terms and are subject to rigorous process of checking

Multiple software versions should be developed in different programming languages.

Nature of tools that are being used should be selected properly.

Training and quality of the programmers should be maintained.

Page 21: Fault Tolerance System

N - Version Programming

Recovery Block Approach

There are two Approaches for that

Page 22: Fault Tolerance System

N - Version Programming

Page 23: Fault Tolerance System

Recovery Block Approach

Page 24: Fault Tolerance System

Applications Of Fault Tolerance System1.Long-life applications:

. e.g. space, satellites

.typical requirement: Availability (10 years) ≥0.95

.outages in between are allowed.

2.Critical-computation applications:

.e.g. critical to human safety: aircraft control system .typical requirement: Reliability (3 years) ≥0.97 (short for 0.9999999).

3.Maintenance-postponement applications:

.when maintenance operations are extremely costly

. e.g. space systems and remote processing systems, like telephone switching systems (e.g. maintenance only once a month)

Page 25: Fault Tolerance System

4.High availability applications:

.e.g. banking, flight reservation

5. Transportation systems

– train/subway– ships– automobiles• ABS anti-locking-brakes• ESP electronic stability program• airbag activation• electronic ignition/fuel pump

Page 26: Fault Tolerance System

STEPS TO PREVENT FAILURE Power Failure  Power Surge . Data loss  Device or Computer failure Unauthorized access  Overload . Virus 

Page 27: Fault Tolerance System

Disadvantages Interference with fault detection in the same component. Interference with fault detection in another component. Reduction of priority of fault correction. Test difficulty. Cost.  Inferior components

Page 28: Fault Tolerance System

Conclusion Hardware, software and networks cannot be totally free from failures Fault tolerance is a non-functional requirement that requires a system to continue to operate, even in the presence of faults. Distributed systems can be more fault tolerant than ccentralized systems. Agrement in faulty systems and reliable group communication are important problems in distributed systems. Replication of Data is a major fault tolerance method in distributed systems. Recovery is another property to consider in faulty distributed environments.

Page 29: Fault Tolerance System

Thank You!