Top Banner
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Developing Dependable Systems

CIS 376

Bruce R. Maxim

UM-Dearborn

Page 2: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Software Dependability

• Customers expect all software to be dependable.

• They may accept some system failures in non-critical applications

• Applications having high dependability requirements require special programming techniques

Page 3: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Achieving Dependability

• Fault avoidance– software developed to minimize impact of human error

– development process is organized so that faults in the software are detected and repaired before customer delivery

• Fault tolerance– software designed so that faults in delivered

software do not cause system failure

Page 4: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Fault Minimization

• Current SE methods can produce fault-free software

• Fault-free software merely conforms to its specification (it may or may not always perform correctly since the specification may be flawed)

• The cost of producing fault-free software is very expensive and may only be justified in exceptional situations

• It may be cheaper to accept some software faults

Page 5: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Developing Fault-Free Software

• Needs a precise (preferably formal) specification• Requires an organizational commitment to quality• Information hiding and encapsulation in software

design are essential• A programming language with strict type checking

and run-time checking should be used• Needs a dependable and repeatable development

process

Page 6: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Error Prone Constructs - part 1

• Floating-point numbers– inherently imprecise, frequent comparison errors

• Pointers– Dangling references and aliases possible

• Dynamic Memory Allocation– memory overflow and garbage problems

• Parallelism– race conditions and deadlocks are possible

Page 7: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Error Prone Constructs - part 2

• Recursion– memory overflow when errors occur

• Interrupts– errors are difficult to trace

• Inheritance– code is no longer localized, unexpected results can arise

when changes are made

Note: You can use these constructs as needed, but you must be careful to use them correctly.

Page 8: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Information Hiding

• Information should only be available to program components on a need to know basis– reduces the probability of accidental corruption of

information

– information is encapsulated to prevent error propagation to rest of program

– since information is localized, programmer is less likely make errors and reviewers are more likely to find errors

Page 9: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Reliable Software Processes

• Having a well-defined, repeatable software process will reduce the number of software faults

• A well-defined repeatable process is one that does not depend entirely on individual skills, but can be carried out by a team

• Significant verification and validation process activities must included to minimize the number of software faults.

Page 10: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Process Validation Activities

• Requirements inspections• Requirements management• Model checking• Design inspections• Code inspections• Static code analysis• Test planning and management• Configuration management

Page 11: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Fault Tolerance

• Required in critical applications (high reliability needed and high failure costs)

• System can continue operation, despite software failure

• A system which seems to be fault-free must also be fault tolerant (in case specification errors exist or the validation is incorrect)

Page 12: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Fault Tolerant Actions• Fault detection

– system determines an incorrect system state has occurred

• Damage assessment– determine system parts affected by fault

• Fault recovery– system must restore its state to a known safe state

• Fault repair– for a non-transitory fault, system is modified to prevent

repetition

Page 13: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Approaches• Defensive Programming

– programmers assume faults exist in system code

– redundant code is written to check system state for consistency after modification are made

• Fault Tolerant Architectures– HW and SW architectures that support redundancy are

used

– a fault tolerance controller that detects problems and supports recovery

• Both approaches are important

Page 14: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Exception Management

• Could be program error or an event like power failure

• Exception handling facilities in programming languages allow exceptions to be handled without constant checking to detect them

• Using normal control constructs to detect exceptions in a sequence of procedural calls adds considerable timing overhead to a program

Page 15: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Fault Detection

• Languages with strict type checking allow many errors to be trapped during program compilation

• Some types of errors can only be caught at run-time (e.g. cin >> I; cin >> A[I];)

Page 16: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Fault Detection Approaches• Preventative Fault Detection

– fault detection mechanism is activated before a state change is committed

– if an erroneous state is detected change is cancelled

• Retrospective Fault Detection– fault detection mechanism is initiated after system state

change has been made

– used when correct sequence of actions can lead to erroneous system state or preventative fault detection has too much overhead

Page 17: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Type System Extension

• Preventative fault detection really involves extending the current type system by including additional constraints as part of the type definition

• These constraints are typically implemented by defining basic operations within a class definition

Page 18: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Damage Assessment

• System is analyzed to judge the extent of corruption caused by a system failure

• Must determine what parts of the state space have been affected by the failure

• Generally based on “validity functions” which can be applied to the state elements to assess if their value is within an allowed range

Page 19: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Damage Assessment Techniques

• Checksums are used to check for data transmission errors

• Redundant pointers can be used to check integrity of data structures

• Watch dog timers can help check for non-terminating processes (e.g. long time with no response assume the worst)

Page 20: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Fault Recovery• Forward Recovery

– apply repairs to corrupted system state

– usually application specific, requires domain knowledge

– e.g. error coding like check sum added to data

• Backward Recovery– restore system to known safe state

– simpler, since archived safe state is used to replace erroneous state

– e.g. use of checkpoints in WP editor

Page 21: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Fault Tolerant Architecture

• Defensive programming can not cope faults caused by HW and SW interactions

• If requirements are not understood then SW checks are not likely to be correct

• Systems with high availability requirements often require fault tolerant architectures

• Must tolerate both HW and SW failure

Page 22: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Hardware Fault Tolerance• Triple-modular redundancy (TMR)• Three replicated component are included in the

system• If one component produces different output than

the other two, failure is assumed• This idea is based on the notion that most failures

result from component failures, not design faults• Component failures should be a low probability

event

Page 23: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Software Fault Tolerance

• TMR is based on two assumptions– HW components do not include common design flaws

– simultaneous component failures are not likely

• Neither assumption is valid for software components– isn’t possible to replicate SW components without

replicating their design flaws

– simultaneous component failure is inevitable

• Software systems must be diverse

Page 24: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Design Diversity

• Different versions of the system are designed and implemented different ways (so they should have different failure rates)

• Different approaches to design– object-oriented and function oriented– different implementation languages– different algorithms in the implementation– different tools or environments

Page 25: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Software Analogies to TMR

• N-version Programming– same specification is implemented in a number of

different version by several teams

– all versions compute simultaneously, the majority output is presumed correct

• Recovery blocks– a number of explicitly distinct versions of a program

are written for the same specification and executed in sequence

– an acceptance test is used to select the output to keep

Page 26: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Problems with Design Diversity• Teams tend to tackle the same problems in the

same ways, so the resulting implementations may not be diverse

• Characteristic errors– different teams are likely make the same mistakes, since

some parts of the implementation are more difficult than others

– specification errors may cause the same errors to appear in all implementations (argument for developing multiple specifications)

Page 27: Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Is software redundancy needed?

• Unlike HW, SW faults are not an inevitable consequence of the real world

• Some people believe that a higher level of reliability can be reducing software complexity instead

• The existence of fault-tolerance controllers increases program complexity considerably and adds sources of errors that affect reliability