Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Developing Dependable Systems

CIS 376

Bruce R. Maxim

UM-Dearborn

Software Dependability

• Customers expect all software to be dependable.

• They may accept some system failures in non-critical applications

• Applications having high dependability requirements require special programming techniques

Achieving Dependability

• Fault avoidance– software developed to minimize impact of human error

– development process is organized so that faults in the software are detected and repaired before customer delivery

• Fault tolerance– software designed so that faults in delivered

software do not cause system failure

Fault Minimization

• Current SE methods can produce fault-free software

• Fault-free software merely conforms to its specification (it may or may not always perform correctly since the specification may be flawed)

• The cost of producing fault-free software is very expensive and may only be justified in exceptional situations

• It may be cheaper to accept some software faults

Developing Fault-Free Software

• Needs a precise (preferably formal) specification• Requires an organizational commitment to quality• Information hiding and encapsulation in software

design are essential• A programming language with strict type checking

and run-time checking should be used• Needs a dependable and repeatable development

process

Error Prone Constructs - part 1

• Floating-point numbers– inherently imprecise, frequent comparison errors

• Pointers– Dangling references and aliases possible

• Dynamic Memory Allocation– memory overflow and garbage problems

• Parallelism– race conditions and deadlocks are possible

Error Prone Constructs - part 2

• Recursion– memory overflow when errors occur

• Interrupts– errors are difficult to trace

• Inheritance– code is no longer localized, unexpected results can arise

when changes are made

Note: You can use these constructs as needed, but you must be careful to use them correctly.

Information Hiding

• Information should only be available to program components on a need to know basis– reduces the probability of accidental corruption of

information

– information is encapsulated to prevent error propagation to rest of program

– since information is localized, programmer is less likely make errors and reviewers are more likely to find errors

Reliable Software Processes

• Having a well-defined, repeatable software process will reduce the number of software faults

• A well-defined repeatable process is one that does not depend entirely on individual skills, but can be carried out by a team

• Significant verification and validation process activities must included to minimize the number of software faults.

Process Validation Activities

• Requirements inspections• Requirements management• Model checking• Design inspections• Code inspections• Static code analysis• Test planning and management• Configuration management

Fault Tolerance

• Required in critical applications (high reliability needed and high failure costs)

• System can continue operation, despite software failure

• A system which seems to be fault-free must also be fault tolerant (in case specification errors exist or the validation is incorrect)

Fault Tolerant Actions• Fault detection

– system determines an incorrect system state has occurred

• Damage assessment– determine system parts affected by fault

• Fault recovery– system must restore its state to a known safe state

• Fault repair– for a non-transitory fault, system is modified to prevent

repetition

Approaches• Defensive Programming

– programmers assume faults exist in system code

– redundant code is written to check system state for consistency after modification are made

• Fault Tolerant Architectures– HW and SW architectures that support redundancy are

used

– a fault tolerance controller that detects problems and supports recovery

• Both approaches are important

Exception Management

• Could be program error or an event like power failure

• Exception handling facilities in programming languages allow exceptions to be handled without constant checking to detect them

• Using normal control constructs to detect exceptions in a sequence of procedural calls adds considerable timing overhead to a program

Fault Detection

• Languages with strict type checking allow many errors to be trapped during program compilation

• Some types of errors can only be caught at run-time (e.g. cin >> I; cin >> A[I];)

Fault Detection Approaches• Preventative Fault Detection

– fault detection mechanism is activated before a state change is committed

– if an erroneous state is detected change is cancelled

• Retrospective Fault Detection– fault detection mechanism is initiated after system state

change has been made

– used when correct sequence of actions can lead to erroneous system state or preventative fault detection has too much overhead

Type System Extension

• Preventative fault detection really involves extending the current type system by including additional constraints as part of the type definition

• These constraints are typically implemented by defining basic operations within a class definition

Damage Assessment

• System is analyzed to judge the extent of corruption caused by a system failure

• Must determine what parts of the state space have been affected by the failure

• Generally based on “validity functions” which can be applied to the state elements to assess if their value is within an allowed range

Damage Assessment Techniques

• Checksums are used to check for data transmission errors

• Redundant pointers can be used to check integrity of data structures

• Watch dog timers can help check for non-terminating processes (e.g. long time with no response assume the worst)

Fault Recovery• Forward Recovery

– apply repairs to corrupted system state

– usually application specific, requires domain knowledge

– e.g. error coding like check sum added to data

• Backward Recovery– restore system to known safe state

– simpler, since archived safe state is used to replace erroneous state

– e.g. use of checkpoints in WP editor

Fault Tolerant Architecture

• Defensive programming can not cope faults caused by HW and SW interactions

• If requirements are not understood then SW checks are not likely to be correct

• Systems with high availability requirements often require fault tolerant architectures

• Must tolerate both HW and SW failure

Hardware Fault Tolerance• Triple-modular redundancy (TMR)• Three replicated component are included in the

system• If one component produces different output than

the other two, failure is assumed• This idea is based on the notion that most failures

result from component failures, not design faults• Component failures should be a low probability

event

Software Fault Tolerance

• TMR is based on two assumptions– HW components do not include common design flaws

– simultaneous component failures are not likely

• Neither assumption is valid for software components– isn’t possible to replicate SW components without

replicating their design flaws

– simultaneous component failure is inevitable

• Software systems must be diverse

Design Diversity

• Different versions of the system are designed and implemented different ways (so they should have different failure rates)

• Different approaches to design– object-oriented and function oriented– different implementation languages– different algorithms in the implementation– different tools or environments

Software Analogies to TMR

• N-version Programming– same specification is implemented in a number of

different version by several teams

– all versions compute simultaneously, the majority output is presumed correct

• Recovery blocks– a number of explicitly distinct versions of a program

are written for the same specification and executed in sequence

– an acceptance test is used to select the output to keep

Problems with Design Diversity• Teams tend to tackle the same problems in the

same ways, so the resulting implementations may not be diverse

• Characteristic errors– different teams are likely make the same mistakes, since

some parts of the implementation are more difficult than others

– specification errors may cause the same errors to appear in all implementations (argument for developing multiple specifications)

Is software redundancy needed?

• Unlike HW, SW faults are not an inevitable consequence of the real world

• Some people believe that a higher level of reliability can be reducing software complexity instead

• The existence of fault-tolerance controllers increases program complexity considerably and adds sources of errors that affect reliability

Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Documents

software failure

software design

developing faultfree

repeatable software

number of software faults

possible slide

incorrect slide

reliable software processes