1 Copyright © 2003 M. E. Kabay. All rights reserved. Critical Systems Specification IS301 – Software Engineering Lecture #18 – 2003-10-23 M. E. Kabay,

1 Copyright © 2003 M. E. Kabay. All rights reserved.

Critical Systems

SpecificationIS301 – Software Engineering

Lecture #18 – 2003-10-23M. E. Kabay, PhD, CISSP

Dept of Computer Information SystemsNorwich University

[email protected]


Acknowledgement

All of the material in this presentation is based directly on slides kindly provided by Prof. Ian Sommerville on his Web site at

http://www.software-engin.com Used with Sommerville’s permission as

extended by him for all non-commercial educational use

Copyright in Kabay’s name applies solely to appearance and minor changes in Sommerville’s work or to original materials and is used solely to prevent commercial exploitation of this material


Topics

Software reliability specificationSafety specificationSecurity specification


Dependable Systems Specification

Processes and techniques for developing specification for

System availabilityReliabilitySafetySecurity


Functional and Non-Functional Requirements

System functional requirements Define error checkingRecovery facilities and featuresProtection against system failures

Non-functional requirements Required reliabilityAvailability of system


System Reliability Specification

Hardware reliability P{hardware component failing}?Time to repair component?

Software reliability P{incorrect output}?Software can continue operation after error

HW often causes stoppageOperator reliability

P{operator error}?


What Happens When All Components Must Work?

Consider system with 2 components A and B where

P{failure of A} = PA

P{failure of B} = PB

P{not A} = 1 – P{A}

P{A&B} = P{A}*P{B}

I.e., at least 1 will fail

Operation of system depends on both of themP{A will not fail} = (1 – PA)

P{B will not fail} = (1 – PB)

P{A & B will both not fail} = (1 – PA) (1 – PB)

P{system failure} = 1 – [(1 – PA) (1 – PB)]


General Principles

If there are a number of elements i with probability of failure Pi and all of them

have to work for the system to work, then the probability of system failure PS is

Therefore, as number of components (all of which need to function) increases then probability of system failure increases

PS = 1 - (1 – Pi) i


Component Replication

If components with failure probability P are replicated so that system works as long as any one of components works, then probability of system failure is

PS = P{all will fail} = Pn

If the system will fail if any of the components fail, then probability of system failure is

PS = P{at least 1 will fail}

= P{not all will work}

= 1 - (1 – P)n


Examples of Functional Reliability Requirements

Predefined range for all values input by operator shall be defined and system shall check all operator inputs fall within predefined range

System shall check all disks for bad blocks when it initialized

System must use N-version programming to implement braking control system

System must be implemented in safe subset of Ada and checked using static analysis


Non-Functional Reliability Specification

Required level of system reliability required should be expressed in quantitatively

Reliability a dynamic system attribute: Reliability specifications related to source

code meaningless:“No more than N faults/1000 lines” -- BADUseful only for post-delivery process

analysis -- trying to assess quality of development techniques

Appropriate reliability metric should be chosen to specify overall system reliability


Reliability Metrics

Reliability metrics: units of measurement of system reliability

Count number of operational failuresRelate to demands on systemTime system has been operational

Long-term measurement programRequired to assess reliability of critical

systems


Reliability Metrics

Metric ExplanationPOFODProbability of failureon demand

The likelihood that the system will fail when a servicerequest is made. For example, a POFOD of 0.001means that 1 out of a thousand service requests mayresult in failure.

ROCOFRate of failureoccurrence

The frequency of occurrence with which unexpectedbehaviour is likely to occur. For example, a ROCOF of2/100 means that 2 failures are likely to occur in each100 operational time units. This metric is sometimescalled the failure intensity.

MTTFMean time to failure

The average time between observed system failures.For example, an MTTF of 500 means that 1 failure canbe expected every 500 time units.

MTTRMean time to repair

The average time between a system failure and thereturn of that system to service.

AVAILAvailability

The probability that the system is available for use at agiven time. For example, an availability of 0.998means that in every 1000 time units, the system islikely to be available for 998 of these.


Probability of Failure on Demand (POFOD)

Probability system will fail when service request made. Useful when demands for service intermittent and relatively infrequent

Appropriate for protection systems where services demanded occasionally and where there serious consequence if service not delivered

Relevant for many safety-critical systems with exception management componentsEmergency shutdown system in chemical

plant


Rate of Fault Occurrence (ROCOF)

Reflects rate of occurrence of failure in system

ROCOF of 0.002 means 2 failures likely in each 1000 operational time units e.g. 2 failures per 1000 hours of operation

Relevant for operating systems, transaction processing systems where system has to process large number of similar requests relatively frequentCredit card processing system, airline

booking system


Mean Time to Failure

Measure of time between observed failures of system. reciprocal of ROCOF for stable systems

MTTF of 500 means mean time between failures 500 time units

Relevant for systems with long transactions i.e. where system processing takes long time. MTTF should be longer than transaction lengthComputer-aided design systems where

designer will work on design for several hours, word processor systems


Availability

Measure of fraction of time system available for use

Takes repair and restart time into accountAvailability of 0.998 means software available

for 998 out of 1000 time unitsRelevant for non-stop, continuously running

systems Telephone switching systems, railway

signaling systems


Failure Consequences

Reliability measurements do NOT take consequences of failure into account

Transient faults may have no real consequences

Other faults may cause Data lossCorruptionLoss of system service

Identify different failure classesUse different metrics for each of these.

Reliability specification must be structured


Failure Consequences

When specifying reliability, it not just number of system failures matter but consequences of these failures

Failures have serious consequences clearly more damaging than those where repair and recovery straightforward

In some cases, therefore, different reliability specifications for different types of failure may be defined


Failure Classification

Failure class DescriptionTransient Occurs only with certain inputsPermanent Occurs with all inputsRecoverable System can recover without operator interventionUnrecoverable Operator intervention needed to recover from failureNon-corrupting Failure does not corrupt system state or dataCorrupting Failure corrupts system state or data


Steps to Reliability Specification

For each sub-system, analyze consequences of possible system failures

From system failure analysis, partition failures into appropriate classes

For each failure class identified, set out reliability using appropriate metric. Different metrics may be used for different

reliability requirements Identify functional reliability requirements to

reduce chances of critical failures


Bank Auto-Teller System

Expected usage statisticsEach machine in network used 300 times

dayLifetime of software release 2 yearsEach machine handles about 220,000

transactions over 2 yearsTotal throughput

Bank has 1,000 ATMs~300,000 database transactions per day~110M transactions per year


Bank ATM (cont’d)

Types of failureSingle-machine failures

Affect individual ATMNetwork failures

Affect groups of ATMsLower throughput

Central database failuresPotentially affect entire network


Examples of Reliability Spec.

Failure Class

Example Reliability Metric

Permanent, non-corrupting

System fails to operate w/ any card input. SW must be restarted to correct failure.

ROCOF 1 occurrence /1,000 days

Transient, non-corrupting

Mag stripe data cannot be read on undamaged card that is input

POFOD

1 in 1,000 transactions

Transient, corrupting

Pattern of transactions across network causes DB corruption

Unquantifiable! Should never happen in lifetime of system


Specification Validation

Impossible to validate very high reliability specifications empirically

E.g., in ATM example:“no database corruptions”=POFOD of less than 1 in 220 million

If transaction takes 1 second, then simulating one day’s ATM transactions on a single system would take 300,000 seconds = 3.5 days

Testing a single run of 110M transactions would take 3.5 yearsIt would take longer than system’s lifetime

(2 years) to test it for reliability


Topics



Safety Specification

Safety requirements of system should beSeparately specifiedBased on analysis of possible hazards and

risksSafety requirements

Usually apply to system as whole rather than to individual sub-systems

In systems engineering terms, safety of system is emergent property


Safety Life-CycleHazard and risk

analysis

Safety req.allocation

Safety req.derivation

Concept andscope definition

Validation O & M Installation

Planning Safety-relatedsystems

development

External riskreductionfacilities

Operation andmaintenance

Planning and development

Systemdecommissioning

Installation andcommissioning

Safetyvalidation


Safety Processes

Hazard and risk analysisAssess hazards and risks of damage

associated with systemSafety requirements specification

Specify set of safety requirements which apply to system

Designation of safety-critical systemsIdentify sub-systems whose incorrect

operation may compromise system safety. Ideally, these should be as small part as possible of whole system.

Safety validationCheck overall system safety


Hazard and Risk Analysis

Hazarddescription

Hazardidentification

Risk analysis andhazard classification

Hazarddecomposition

Risk reductionassessment

Riskassessment

Fault treeanalysis

Preliminary safetyrequirements


Hazard Analysis Stages

Hazard identificationIdentify potential hazards which may arise

Risk analysis and hazard classificationAssess risk associated with each hazard

Hazard decompositionDecompose hazards to discover their

potential root causesRisk reduction assessment

Define how each hazard must be taken into account when system designed


Fault-Tree Analysis

Method of hazard analysisStarts with identified faultWorks backward to causes of fault

Used at all stages of hazard analysisPreliminary analysisDetailed SW checking

Top-down hazard analysis methodMay be combined with bottom-up methods

Start with system failuresLead to hazards


Fault-Tree Analysis

Identify hazard Identify potential causes of hazard

Usually several alternative causesLink these on fault-tree with ‘or’ or ‘and’

symbolsContinue process until root causes identifiedConsider following example

How data might be lost System where backup process running


Fault Tree Example

Data deleted

H/W failure S/W failureExternal attack Operator failure

Operating system failureBackup system failure

Incorrect configurationIncorrect operator input Execution failure

Timing fault Algorithm fault Data faultUI design fault Training fault Human error

or or or or

or or

or or or

or ororor

oror

Data deleted

H/W failure S/W failureExternal attack Operator failure

Operating system failureBackup system failure

Incorrect configurationIncorrect operator input Execution failure

Timing fault Algorithm fault Data faultUI design fault Training fault Human error

or or or or

or or

or or or

or ororor

oror


Risk Assessment

Assesses hazard severity, hazard probability and accident probability

Outcome of risk assessment statement of acceptabilityIntolerable. Must never arise or result in

accidentAs low as reasonably practical (ALARP)

Must minimize possibility of hazard given cost and schedule constraints

Acceptable. Consequences of hazard acceptable and no extra costs should be incurred to reduce hazard probability


Levels of Risk

Unacceptable regionrisk cannot be tolerated

Risk tolerated only ifrisk reduction is impractical

or grossly expensive

Acceptableregion

Negligible risk

ALARPregion

As low as reasonably

practical

RIS

KS

COSTS


Risk Acceptability

Acceptability of risk determined by human, social and political considerations

In most societies, boundaries between regions pushed upwards with time; i.e., society increasingly less willing to accept riskFor example, costs of cleaning up pollution

may be less than costs of preventing it but pollution may not be socially acceptable

Risk assessment often highly subjectiveOften lack hard data on real probabilitiesRisks identified as probable, unlikely, etc.

depends on who making assessment


Why Do We Lack Firm Risk Probabilities and Costs?

Failure of observation – don’t noticeFailure of reporting – don’t tell anyoneVariability of systems – can’t pool dataDifficulty of classifying incidents – can’t

compare problemsDifficulty of measuring costs – don’t know all

repercussions


Risk Reduction

System should be specified so hazards do not arise or result in accident

Hazard avoidanceDesign so hazard can never arise during

correct system operationHazard detection and removal

Design so hazards are detected and neutralized before they result in accident

Damage limitation or mitigationDesign so consequences of accident are

minimized or at least reduced


Specifying Forbidden Behavior: Examples

System shall not allow users to modify access permissions on any files they have not created (security)

System shall not allow reverse thrust mode to be selected when aircraft in flight (safety)

System shall not allow simultaneous activation of more than three alarm signals (safety)


Topics



Security Specification

Similar to safety specificationNot possible to specify security

requirements quantitativelyRequirements often ‘shall not’ rather than

‘shall’ requirementsDifferences

No well-defined notion of security life cycle for security management

Generic threats rather than system specific hazards

Mature security technology (encryption, etc.) but problems in transferring into general use – corporate culture


Security Specification Process

System assetlist

Assetidentification

Threat analysis andrisk assessment

Threatassignment

Security req.specification

Threat andrisk matrix

Asset andthreat

description

Securityrequirements

Technologyanalysis

Securitytechnology

analysis


Stages in Security Specification (1)

Asset identification and evaluation Assets (data and programs) identifiedRequired degree of protection

Criticality and sensitivityThreat analysis and risk assessment

Possible threatsRisks estimated

Threat assignment Identified threats related to assetsFor each identified asset, list of associated

threats


Stages in Security Specification (2)

Technology analysis Identify available security technologiesAssess applicability against identified

threatsSecurity requirements specification

PolicyProcedureTechnology


HOMEWORK

Apply full Read-Recite-Review phases of SQ3R to Chapter 17 of Sommerville’s text

For next class (Tuesday), apply Survey-Question phases to Chapter 18 on Critical Systems Development.

For Thursday 30 Nov 2003: REQUIREDHand in responses to Exercises 17.1(2

points), .2(6), .3(4), .4(4), .5(2), .6(6) and .7(6) = 30 points total

OPTIONAL by 6 Nov: 17.8 and/or 17.9 for 3 extra points each.


DISCUSSION

1 Copyright © 2003 M. E. Kabay. All rights reserved. Critical Systems Specification IS301 – Software Engineering Lecture #18 – 2003-10-23 M. E. Kabay,

Documents