Top Banner
Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn
37
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Software Reliability

CIS 376

Bruce R. Maxim

UM-Dearborn

Page 2: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Functional and Non-functional Requirements

• System functional requirements may specify error checking, recovery features, and system failure protection

• System reliability and availability are specified as part of the non-functional requirements for the system.

Page 3: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

System Reliability Specification

• Hardware reliability– probability a hardware component fails

• Software reliability– probability a software component will produce an

incorrect output

– software does not wear out

– software can continue to operate after a bad result

• Operator reliability– probability system user makes an error

Page 4: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Failure Probabilities

• If there are two independent components in a system and the operation of the system depends on them both then

P(S) = P(A) + P(B)• If the components are replicated then the

probability of failure is

P(S) = P(A)n

meaning that all components fail at once

Page 5: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Functional Reliability Requirements

• The system will check the all operator inputs to see that they fall within their required ranges.

• The system will check all disks for bad blocks each time it is booted.

• The system must be implemented in using a standard implementation of Ada.

Page 6: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Non-functional Reliability Specification

• The required level of reliability must be expressed quantitatively.

• Reliability is a dynamic system attribute.

• Source code reliability specifications are meaningless (e.g. N faults/1000 LOC)

• An appropriate metric should be chosen to specify the overall system reliability.

Page 7: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Hardware Reliability Metrics

• Hardware metrics are not suitable for software since its metrics are based on notion of component failure

• Software failures are often design failures

• Often the system is available after the failure has occurred

• Hardware components can wear out

Page 8: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Software Reliability Metrics

• Reliability metrics are units of measure for system reliability

• System reliability is measured by counting the number of operational failures and relating these to demands made on the system at the time of failure

• A long-term measurement program is required to assess the reliability of critical systems

Page 9: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Reliability Metrics - part 1

• Probability of Failure on Demand (POFOD)– POFOD = 0.001– For one in every 1000 requests the service fails

per time unit

• Rate of Fault Occurrence (ROCOF)– ROCOF = 0.02– Two failures for each 100 operational time

units of operation

Page 10: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Reliability Metrics - part 2

• Mean Time to Failure (MTTF) – average time between observed failures (aka

MTBF)

• Availability = MTBF / (MTBF+MTTR)– MTBF = Mean Time Between Failure– MTTR = Mean Time to Repair

• Reliability = MTBF / (1+MTBF)

Page 11: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Time Units

• Raw Execution Time– non-stop system

• Calendar Time– If the system has regular usage patterns

• Number of Transactions– demand type transaction systems

Page 12: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Availability

• Measures the fraction of time system is really available for use

• Takes repair and restart times into account

• Relevant for non-stop continuously running systems (e.g. traffic signal)

Page 13: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Probability of Failure on Demand

• Probability system will fail when a service request is made

• Useful when requests are made on an intermittent or infrequent basis

• Appropriate for protection systems service requests may be rare and consequences can be serious if service is not delivered

• Relevant for many safety-critical systems with exception handlers

Page 14: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Rate of Fault Occurrence

• Reflects rate of failure in the system

• Useful when system has to process a large number of similar requests that are relatively frequent

• Relevant for operating systems and transaction processing systems

Page 15: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Mean Time to Failure

• Measures time between observable system failures

• For stable systems MTTF = 1/ROCOF

• Relevant for systems when individual transactions take lots of processing time (e.g. CAD or WP systems)

Page 16: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Failure Consequences - part 1

• Reliability does not take consequences into account

• Transient faults have no real consequences but other faults might cause data loss or corruption

• May be worthwhile to identify different classes of failure, and use different metrics for each

Page 17: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Failure Consequences - part 2

• When specifying reliability both the number of failures and the consequences of each matter

• Failures with serious consequences are more damaging than those where repair and recovery is straightforward

• In some cases, different reliability specifications may be defined for different failure types

Page 18: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Failure Classification

• Transient - only occurs with certain inputs• Permanent - occurs on all inputs• Recoverable - system can recover without

operator help• Unrecoverable - operator has to help• Non-corrupting - failure does not corrupt system

state or data• Corrupting - system state or data are altered

Page 19: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Building Reliability Specification

• For each sub-system analyze consequences of possible system failures

• From system failure analysis partition failure into appropriate classes

• For each class send out the appropriate reliability metric

Page 20: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Examples

Failure Class Example Metric

PermanentNon-corrupting

ATM fails tooperate with anycard, must restart tocorrect

ROCOF = .0001Time unit = days

TransientNon-corrupting

Magnetic stripecan't be read onundamaged card

POFOD = .0001Time unit =transactions

Page 21: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Specification Validation

• It is impossible to empirically validate high reliability specifications

• No database corruption really means POFOD class < 1 in 200 million

• If each transaction takes 1 second to verify, simulation of one day’s transactions takes 3.5 days

Page 22: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Statistical Reliability Testing

• Test data used, needs to follow typical software usage patterns

• Measuring numbers of errors needs to be based on errors of omission (failing to do the right thing) and errors of commission (doing the wrong thing)

Page 23: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Difficulties with Statistical Reliability Testing

• Uncertainty when creating the operational profile

• High cost of generating the operational profile

• Statistical uncertainty problems when high reliabilities are specified

Page 24: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Safety Specification

• Each safety specification should be specified separately

• These requirements should be based on hazard and risk analysis

• Safety requirements usually apply to the system as a whole rather than individual components

• System safety is an an emergent system property

Page 25: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Safety Life Cycle - part 1

• Concept and scope definition• Hazard and risk analysis• Safety requirements specification

– safety requirements derivation– safety requirements allocation

• Planning and development– safety related systems development– external risk reduction facilities

Page 26: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Safety Life Cycle - part 2

• Deployment– safety validation– installation and commissioning

• Operation and maintenance• System decommissioning

Page 27: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Safety Processes

• Hazard and risk analysis– assess the hazards and risks associated with the system

• Safety requirements specification– specify system safety requirements

• Designation of safety-critical systems– identify sub-systems whose incorrect operation can

compromise entire system safety

• Safety validation– check overall system safety

Page 28: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Hazard Analysis Stages

• Hazard identification– identify potential hazards that may arise

• Risk analysis and hazard classification– assess risk associated with each hazard

• Hazard decomposition– seek to discover potential root causes for each hazard

• Risk reduction assessment– describe how each hazard is to be taken into account

when system is designed

Page 29: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Fault-tree Analysis

• Hazard analysis method that starts with an identified fault and works backwards to the cause of the fault

• Can be used at all stages of hazard analysis

• It is a top-down technique, that may be combined with a bottom-up hazard analysis techniques that start with system failures that lead to hazards

Page 30: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Fault-tree Analysis Steps

• Identify hazard

• Identify potential causes of hazards

• Link combinations of alternative causes using “or” or “and” symbols as appropriate

• Continue process until “root” causes are identified (result will be an and/or tree or a logic circuit) the causes are the “leaves”

Page 31: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

How does it work?

• What would a fault tree look like for a fault tree describing the causes for a hazard like “data deleted”?

Page 32: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Risk Assessment

• Assess the hazard severity, hazard probability, and accident probability

• Outcome of risk assessment is a statement of acceptability– Intolerable (can never occur)

– ALARP (as low as possible given cost and schedule constraints)

– Acceptable (consequences are acceptable and no extra cost should be incurred to reduce it further)

Page 33: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Risk Acceptability

• Determined by human, social, and political considerations

• In most societies, the boundaries between regions are pushed upwards with time (meaning risk becomes less acceptable)

• Risk assessment is always subjective (what is acceptable to one person is ALARP to another)

Page 34: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Risk Reduction• System should be specified so that hazards do not

arise or result in an accident• Hazard avoidance

– system designed so hazard can never arise during normal operation

• Hazard detection and removal– system designed so that hazards are detected and

neutralized before an accident can occur

• Damage limitation– system designed to minimized accident consequences

Page 35: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Security Specification

• Similar to safety specification– not possible to specify quantitatively– usually stated in “system shall not” terms rather

than “system shall” terms

• Differences– no well-defined security life cycle yet– security deals with generic threats rather than

system specific hazards

Page 36: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Security Specification Stages - part 1• Asset identification and evaluation

– data and programs identified with their level of protection

– degree of protection depends on asset value

• Threat analysis and risk assessment– security threats identified and risks associated with each

is estimated

• Threat assignment– identified threats are related to assets so that asset has a

list of associated threats

Page 37: Software Reliability CIS 376 Bruce R. Maxim UM-Dearborn.

Security Specification Stages - part 2• Technology analysis

– available security technologies and their applicability against the threats

• Security requirements specification– where appropriate these will identify the security

technologies that may be used to protect against different threats to the system