Top Banner
COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle COMP-667 Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory School of Computer Science McGill University
53

Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

Apr 20, 2018

Download

Documents

phungthuan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

COMP-667

Software Fault ToleranceFundamental Concepts

Jörg KienzleSoftware Engineering Laboratory

School of Computer ScienceMcGill University

Page 2: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Overview(Pullum Chapter 1 / Kienzle 1.4)

• Motivation for Fault Tolerance• Terminology

• Faults, Errors and Failures• Dependability• Recovery

• Backward and forward• Redundancy• Error Confinement

2

Page 3: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Motivation (1)

• Scope, complexity and pervasiveness of computer-based and controlled systems continue to increase

• Software assumes more and more responsibility• Consequences of systems failing• Annoying to catastrophic• Opportunities lost, businesses failed, security breaches,

systems destroyed, lives lost

3

Page 4: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Examples of Software Failures (1)

On June 4, 1996 an Ariane Vrocket launched by theEuropean Space Agencyexploded just forty secondsafter lift-off

4

Page 5: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Ariane V Architecture

5

Sensors Engines

IRS OBC

IRS2IRS1

OBC2OBC1

“hot standby”

Page 6: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 Software Fault Tolerance - Course Overview - © 2009 Jörg Kienzle

Ariane V Launch, June 4th 1996

6

IRS raises an Operand Error exception whileconverting a 64bit float to 16bit integer

No specific exception handlerOperand Error caused by high value of Horizontal Bias,

which is normal for Ariane VFunction serves no purpose after lift-off in Ariane 5

Ariane IV, from which the code was reused, needs it during 50 secondsNot possible to switch to backup IRS, for it had failed as well (72ms earlier)

On-board Computer interprets “core dump” data as normal flight dataFull nozzle deflection of solid boosters and vulcan engine

Angle of attack > 20˚Separation of boosters from main stage

Self-destruction after 39 seconds

Page 7: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Examples of Software Failures (2)

• Aerospace• Denver airport: Failure in luggage management system ⇒ opening delayed for several months

• Failure of a space probe sent to Mars due to inhomogeneity of measuring units (inch and cm)• Launch of Atlantis delayed 3 days• Problems when space shuttle Endeavor met with Intelstat

6 due to rounding of near-zero values• Flaw in Apollo 11 software made moons gravity

repulsive rather than attractive

7

Page 8: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Examples of Software Failures (3)• AT&T system suffered a 9 hour US-wide blockade

• Switch experienced abnormal behavior ⇒ due to flaws in recovery recognition software and network design effects propagated to all switches

• Software problem caused radiation safety door of a nuclear power processing plant in the UK to open accidentally

• Several patients killed through radiation overdoses due to software flaws in Therac-25 (cancer treatment system)

8

Page 9: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Motivation (2)

• Considerable progress in software engineering• Analysis• Design• Testing• Formal methods• CASE tools• Experience shows that we still can not assume

that the produced software is fault free

9

Page 10: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

• Failure• Observable deviation from the specification

• Error• Part of the system state that leads to a failure• Latent errors [Lap85]

• Fault• “Defect” or “Flaw” of a system• Bug

Terminology

10

Page 11: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Causal Relationship

Fault Error Failure

(Failure ⇒ Error ⇒ Fault)

• Hierarchical model • Failure at one level can be seen as a fault at a higher

level

activation propagation

11

Page 12: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Goal of Fault Tolerance

The Goal of Fault Tolerance is toAvoid System Failure in the Presence of Faults

12

Page 13: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Fault Tolerance

• Continue to provide service in the presence of faults of underlying components or the environment

13

Fault Error Failure Consequence

Internal(System/Component/Object) External

Time

Fault Tolerance

Latency Inertia

Page 14: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Origin of Faults

ProcessManagement

Specification

Operation

Maintenance

HumanInteraction

Design

Implementation Environment

ComponentDefectsReuse

RequirementsEngineering

Documentation

14

Page 15: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Fault Classification• Temporal Occurrence

• Transient fault• Intermittent fault (periodic fault)• Permanent fault

• Creation time• Design fault• Operational fault

• Intention• Accidental fault• Intentional fault

15

Page 16: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Failure Semantics

• Crash failure• Fail-silent and Fail-stop

• Omission failure• Timing failure• System fails to respond within a specified time slice• Both late and early responses might be “bad”• Also called performance failure

• Byzantine failure• System behaves arbitrarily

16

Page 17: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

ByzantineTiming

Failure Hierarchy

The algorithms used forachieving any kind of

fault tolerance depend onthe computational model

17

OmissionCrash

Fail Stop

Page 18: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

StatisticalInference

Reliable Software Development

18

Fault Tolerance

Rigorous Software

DevelopmentModel-DrivenArchitecture

FormalMethods

SoftwareReuse

ClearDocumentation

Fault Avoidance

Experience

QualityEstimation

ReliabilityMeasurement

Fault Forecasting

Verification

Validation

Testing

FormalInspection

Fault Removal

Page 19: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Fault Avoidance / Prevention

• Reduce the number of faults during software construction• Rigorous Software Development Process• Requirements Specification & Analysis• Structured Design• Well-defined mapping to Programming Languages• Clear Documentation• Formal Methods• Software Reuse

19

Page 20: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Rigorous Software Development (1)

• Requirements elicitation• Discover what features each stakeholder expects the

system to provide• Imperfect process• Technical and non-technical people have to collaborate

• Use-cases• Computer scientists can’t be experts in all application

areas

20

Page 21: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Rigorous Software Development (2)

• Analysis / Specification• Specify in a clear and precise way what functionality

your system must provide• Complete, but not too complex• Consistent• Determine (or even better: generate) test cases

21

Page 22: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Drink Distributor Example (1)

• Provides hot drinks: coffee, tea and chocolate• User interface• Cycle treatment

1. Insert money2. Choose drink3. Take change4. Take drink• Or press cancel ⇒ coins are given back

ChangeDrink

Coins Cancel

CoffeeTeaChocolate

Drink Distributor

22

Page 23: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Drink Distributor Example (2)• Incomplete specification

• No deadline for cancellation specified• What if user inserts new coins before the end of a cycle?• What if the user changes his selection?• What should be done when resources (change, cups, spoons,

sugar, coffee, tea, chocolate, water) run out?• Provide partial service?

(e.g. only tea and coffee / require exact change)• If manufacturer and user make divergent interpretations,

operation time failure will occur

23

Page 24: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Drink Distributor Example (3)• Augment specification

• Cancellation not possible once drink has been chosen• Add green / red light to indicate cycle start• Only the first selected beverage is taken into account• Add lights to show availability of drinks

• Each omission of constraint in the specification can lead to a failure in the service delivered to the user

• Dissatisfaction• Loss of money

24

Page 25: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Rigorous Software Development (3)

• Structured design• For instance in Object-Orientation:

Apply O-O principles, e.g. abstraction, information hiding, modularity, classification, to reduce complexity of the solution• Assign responsibilities to objects• Provide easy-to-read documentation• UML

25

Page 26: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Rigorous Software Development (4)

• Programming Methodology• Good programming discipline• Pair-programming• Well-defined mapping of design models to programming

constructs• Standards or coding conventions

26

Page 27: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Formal Methods (1)

• Specifications are developed using mathematically tractable languages and tools• Petri Nets, Algebraic Specifications• Allows proving of desired properties• Verification and validation• Generation of test cases• Generation of code!

27

Page 28: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Formal Methods (2)

• Mathematical specifications of software tend to be equal in size as the program itself⇒ just as error-prone

• Tools (model-checkers) still face algorithmic challenges when attempting to prove properties of huge models

• Have been successfully applied for “small”, safety-critical components

• Domain-specific modeling!

28

Page 29: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Software Reuse

• Well exercised software is less likely to fail• Save development cost• Undiscovered faults may appear when the

component is used in a new environment

29

Page 30: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Fault Removal

• Detect and remove existing faults by verification and validation

• Testing• Exhaustive testing not feasible• Can’t show the absence of faults• Quality measures• Formal Inspection• Formal Design Proofs

30

Page 31: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Fault Forecasting

• Also known asSoftware reliability measurement [Lyu96]

• Estimation• Gather failure data during operation or testing• Apply statistical inference techniques• Prediction• Gather software metrics during development• Fault forecasting can indicate the need for

additional testing or for applying fault tolerance

31

Page 32: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Seriousness Classes (1)• DO-178B, civil aeronautics

• Without effects• Minor / benign• Upset passengers, small increase in workload for the crew

• Major / significant• Injuries of the passengers / crew and reducing the efficiency of the

crew• Dangerous / serious• Small number of casualties / serious injuries, or preventing the crew

from achieving its task in a precise and complete manner• Catastrophic / disastrous• Leading to human lives loss

32

Page 33: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Seriousness Classes (2)• DO-178B, civil aeronautics• Without effects• Minor / benign

• Probable: p > 10-5

• Major / significant• Rare: 10-7 < p < 10-5

• Dangerous / serious• Extremely rare: 10-9 < p < 10-7

• Catastrophic / disastrous• Extremely improbable: p < 10-9

33

Page 34: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Software Fault Tolerance

• Tolerate faults that remain in the system after development, preventing system failure ⇒ Remove errors and their effects from the computational state before a failure occurs

• Successfully applied in aerospace, nuclear power, healthcare, telecommunications and transportation industries

• 35 years of research

34

Page 35: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Classification

• Single Version Software• Monitoring techniques, atomicity of actions, decision

verification, exception handling• Multi-version Software• Functionally independent, yet equivalent software• Recovery blocks, N-version programming, …• Multiple Data Representation• Retry blocks, N-copy programming, …

35

Page 36: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

EFTS 2006 - Exceptions and the Software Life-Cycle: Starting with Requirements

Recovery

• Error detection• Identify erroneous state• Error diagnosis• Assess the damage• Error containment / isolation• Prevent further damage / error propagation• Error recovery• Substitute the erroneous state with an error-free one• Backward and Forward Error Recovery

36

Page 37: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Backward Error Recovery (1)

• System state is saved at predetermined recovery points• Called checkpointing• Incremental checkpointing, log• State should be checkpointed on stable storage,

not affected by failures• Recover error-free state by rolling back to a

previously saved (error-free) state

37

Page 38: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Backward Error Recovery (2)

Errordetected

Assumption:Faulty behavior occurred

after last checkpoint

38

Checkpoint

Checkpoint

Fault Manifests

Page 39: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2009 Jörg Kienzle

Backward Error Recovery (2)

Assumption:Faulty behavior occurred

after last checkpoint

39

Checkpoint

Checkpoint

Rollback

Page 40: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2009 Jörg Kienzle

Backward Error Recovery (2)

40

Checkpoint

Checkpoint

Checkpoint

Checkpoint

Depending on the assumed fault and on the specific fault tolerance technique used:

• Try again• Try a different alternate• Do nothing (wait for the next request)

Rollback

Page 41: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Advantages of Backward Recovery• Requires no knowledge of the errors in the system state• Can handle arbitrary / unpredictable faults (as long as

they do not affect the recovery mechanism)• Can be applied regardless of the sustained damage (the

saved state must be error-free, though)• General scheme / application independent• Particularly suitable for recovering from transient faults

41

Page 42: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Disadvantages of Backward Recovery• Requires significant resources (e.g. time,

computation, stable storage) for checkpointing and recovery

• Checkpointing requires• To identify consistent states• The system to be halted / slowed down temporarily• Care must be taken in concurrent

systems to avoid the domino effect

42

Page 43: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Forward Error Recovery

• Detect the error• Detailed damage assessment• Build a new error-free state from which the

system can continue execution• “Safe stop”• Degraded mode• Error compensation

• E.g., switching to a different component, etc…

43

Page 44: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Forward Error Recovery (2)

44

Fault ManifestsError

detected

Damage Assessment State ReconstructionError Confinement

Page 45: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Advantages of Forward Recovery

• Efficient (time / memory)• If the characteristics of the fault are well understood,

forward recovery is the most efficient solution• Well suited for real-time applications• Missed deadlines can be addressed• Anticipated faults can be dealt with in a timely

way using redundancy

45

Page 46: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Disadvantages of Forward Recovery

• Application-specific• Can only remove predictable errors from the

system state• Requires knowledge of the actual error• Depends on the accuracy of error detection,

potential damage prediction, and actual damage assessment

• Not usable if the system state is damaged beyond recoverability

46

Page 47: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Redundancy• Key concept of fault tolerance• Hardware redundancy

• Most common use of redundancy• We’re not going to address it• Software redundancy

• Additional applications, modules, objects used in the system to support fault tolerance

• Information redundancy• Error-detecting or error-correcting codes• Diverse data• Data produced for fault tolerance• Time redundancy

• Use additional time for fault tolerance

47

Page 48: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Architectural Structure

• Systems, especially concurrent ones, are increasingly complex

• Consist of several components / subcomponents• Fault tolerance must account for that• Different fault tolerance approaches for each components• Failure of a subcomponent can be perceived as a fault in

the parent component• Clear structuring reduces complexity

48

Page 49: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Error Confinement

• System partitioned into regions, beyond which effects of faults should not propagate

• Components should only be accessible through a well-defined (and preferably narrow [Kop97]) interface

• Different confinement regions may employ different fault tolerance techniques depending on failure semantics of the environment and subcomponents

49

Page 50: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Idealized Fault Tolerant ComponentNormalProcessing

Idealized Fault-Tolerant Component [Lee90]

50

ServiceRequest Reply

ServiceRequest Reply

FailureException

InterfaceException

InterfaceException

FailureException

ErrorProcessing

Local Exception

Page 51: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Idealized Fault-Tolerant Component

• Receives requests for service• Produces responses• 3 kinds of exceptions

• Interface exception: An invalid service request has been made

• Local exception: An internal error is detected• Failure exception: Component is unable to provide the

requested service• Recursive structure

51

Page 52: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

Questions

• What are the four means for achieving dependability?

• What is the goal of software fault tolerance?• Name the two error recovery strategies, and

briefly explain how they work…• What are the different forms of redundancy that

can help constructing fault tolerant software?• What are latency and inertia?

52

Page 53: Software Fault Tolerance Fundamental Conceptsjoerg/SEL/COMP-667_Handouts_files/COMP... · Software Fault Tolerance Fundamental Concepts Jörg Kienzle Software Engineering Laboratory

COMP-667 - Fundamental Concepts - © 2011 Jörg Kienzle

References• [Lap85]

Laprie, J.-C.: “Dependable Computing and Fault Tolerance : Concepts and Terminology”, in Proceedings of the 15th International Symposium on Fault–Tolerant Computing Systems (FTCS–15), pp. 2 – 11, Ann Arbour, MI, USA, June 1985

• [Lyu96]Lyu, M. R. (ed.): Handbook of Software Reliability Engineering, New York, IEEE Computer Society Press, McGraw-Hill, 1996.

• [Kop97]Kopetz, H.: Real–Time Systems — Design Principles for Distributed Embedded Applications. Kluwer Academic Publishers, 1997.

• [RX95]Randell, B.; Xu, J.: The Evolution of the Recovery Block Concept, chapter 1, pp. 1 – 21, in Lyu, M. R. (Ed.): Software Fault Tolerance, John Wiley & Sons, 1995.

53