Page 1
1 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Critical Systems
DevelopmentIS301 – Software Engineering
Lecture #27 – 2004-11-03M. E. Kabay, PhD, CISSP
Assoc. Prof. Information AssuranceDivision of Business & Management, Norwich University
mailto:[email protected] V: 802.479.7937
Page 2
2 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
First, take a deep breath.You are about to enter the
fire-hose zone.
Page 3
3 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Objectives
To explain how fault tolerance and fault avoidance contribute to the development of dependable systems
To describe characteristics of dependable software processes
To introduce programming techniques for fault avoidance
To describe fault tolerance mechanisms and their use of diversity and redundancy
Page 4
4 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Topics covered
Dependable processesDependable programmingFault toleranceFault tolerant architectures
Page 5
5 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Dependable Software Development
Programming techniques for building dependable software systems.
Page 6
6 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Software Dependability
In general, software customers expect all software to be dependable
For non-critical applications, may be willing to accept some system failures
Some applications have very high dependability requirements Special programming techniques req’d
Page 7
7 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Dependability Achievement
Fault avoidanceSoftware developed so
Human error avoided and System faults minimized
Development process organized so Faults in software detected and Repaired before delivery to customer
Fault toleranceSoftware designed so
Faults in delivered software do not result in system failure
Page 8
8 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Diversity and RedundancyRedundancy
Keep more than 1 version of a critical component available so that if one fails then a backup is available.
DiversityProvide the same functionality in different
ways so that they will not fail in the same way.However, adding diversity and redundancy adds
complexity and this can increase the chances of error.
Some engineers advocate simplicity and extensive verification & validation (V&V) as a more effective route to software dependability.
Page 9
9 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Diversity and Redundancy ExamplesRedundancy
Where availability is critical (e.g. in e-commerce systems),
companies normally keep backup servers and switch to these automatically if failure occurs.
Diversity. To provide resilience against external attacks, different servers may be implemented using
different operating systems (e.g. Windows and Linux)
Page 10
10 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Minimization
Current methods of software engineering now
allow for production of fault-free softwareFault-free software means it conforms to its
specificationDoes NOT mean software
which will always perform correctly
Why not?
Because of specificatio
n errors.
Page 11
11 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Cost of Producing Fault-Free Software (1)
Very highCost-effective only in exceptional
situationsWhich?
May be cheaper to accept software faultsBut who will bear costs?
Users?Manufacturers?Both?
Will the risk-sharing be with full knowledge?
Page 12
12 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Cost of Producing Fault-Free Software (2)
The Pareto Principle
Costs
To
tal
% o
f E
rro
rs F
ixed
20%
80%
100%
If curve really is asymptotic to 100%, cost
may
approach
Page 13
13 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Cost of ProducingFault-Free Software (3)
Many Few Very fewNumber of residual errors
Co
st p
er e
rro
r d
etec
ted
Just a different way of
looking at it.
Page 14
16 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Validation activities
Requirements inspections.Requirements management.Model checking.Design and code inspection.Static analysis.Test planning and management.Configuration management, discussed in
Chapter 29, is also essential.
Page 15
19 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Safe Programming
Faults in programs are usually a consequence of programmers making mistakes.
These mistakes occur because people lose track of the relationships among program variables.
Some programming constructs are more error-prone than others so avoiding their use reduces programmer mistakes.
Page 16
20 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault-Free Software Development
Needs precise (preferably formal) specification
Requires organizational commitment to quality
Information hiding and encapsulation in software design essential
Use programming language with strict typing and run-time checking
Avoid error-prone constructsUse dependable and repeatable development
process
Page 17
21 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Structured Programming
First discussed in 1970'sProgramming without gotoWhile loops and if statements as only
control statementsTop-down design Important because it promoted thought and
discussion about programmingPrograms easier to read and understand than
old spaghetti code
Page 18
22 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Error-Prone Constructs (1)Floating-point numbers
Inherently imprecise – and machine-dependent
Imprecision may lead to invalid comparisons
PointersPointers referring to wrong memory as can
corrupt dataAliasing can make programs difficult to
understand and changeDynamic memory allocation
Run-time allocation can cause memory overflow
Page 19
23 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Error-Prone Constructs (2)Parallelism
Can result in subtle timing errors (race conditions) because of unforeseen interaction between parallel processes
RecursionErrors in recursion can cause memory
overflow Interrupts
Interrupts can cause critical operation to be terminated and make program difficult to understand
Similar to goto statements
Page 20
24 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Error-Prone Constructs (3)
InheritanceCode not localizedCan result in unexpected behavior when
changes madeCan be hard to understandDifficult to debug problems
All of these constructs don’t have to be absolutely eliminated But must be used with great care
Page 21
25 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Reliable Software Processes
Well-defined, repeatable software process:Reduces software faultsDoes not depend entirely on individual
skills – can be enacted by different peopleProcess activities should include significant
verification and validation
Page 22
26 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Process Validation Activities
Requirements inspectionsRequirements managementModel checkingDesign and code inspectionStatic analysisTest planning and managementConfiguration management also essential
Page 23
27 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Tolerance
Critical software systems must be fault tolerantSystem can continue operating in spite of
software failureFault tolerance required in
High availability requirements orSystem failure costs very high
Even “fault-free” systems need fault tolerance May be specification errors orValidation may be incorrect
Page 24
28 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Tolerance ActionsFault detection
Incorrect system state has occurredDamage assessment
Identify parts of system state affected by fault
Fault recoveryReturn to known safe state
Fault repairPrevent recurrence of faultIdentify underlying problemIf not transient*, then fix errors of design,
implementation, documentation or training that led to error
E.g., hardware failure
*
Page 25
29 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Approaches to Fault ToleranceDefensive programming
Programmers assume faults in codeCheck state after modifications to ensure
consistencyFault-tolerant architectures
HW & SW system architectures support redundancy and fault tolerance
Controller detects problems and supports fault recovery
Complementary rather than opposing techniques
Page 26
30 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Detection (1)
Strictly-typed languages E.g., Java and Ada Many errors trapped at compile-time
Some classes of error can only be discovered at run-time
Fault detection: Detecting erroneous system state Throwing exception
To manage detected fault
Page 27
31 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Detection (2)
Preventative fault detectionCheck conditions before making changesIf bad state detected, don’t make change
Retrospective fault detectionCheck validity after system state has been
changedUsed when
Incorrect sequence of correct actions leads to erroneous state or
When preventative fault detection involves too much overhead
Page 28
32 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Damage Assessment
Analyze system stateJudge extent of corruption caused by
system failureAssess what parts of state space have been
affected by failureGenerally based on ‘validity functions’
Can be applied to state elements Assess if their value within allowed range
Page 29
33 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Damage Assessment Techniques
Checksums Used for damage assessment in data
transmissionVerify integrity after transmission
Redundant pointers Check integrity of data structuresE.g., databases
Watch-dog timers Check for non-terminating processesIf no response after certain time, there’s a
problem
Page 30
34 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Recovery
Forward recoveryApply repairs to corrupted system stateDomain knowledge required to compute
possible state correctionsForward recovery usually application
specificBackward recovery
Restore system state to known safe stateSimpler than forward recoveryDetails of safe state maintained and
replaces corrupted system state
Page 31
35 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Forward Recovery
Data communicationsAdd redundancy to coded dataUse to repair data corrupted during
transmissionRedundant pointers
E.g., doubly-linked lists Damaged list / file may be repaired if
enough links are still validOften used for database and file system
repair
Page 32
36 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Backward Recovery
Transaction processing often uses conservative methods to avoid problems
Complete computations, then apply changesKeep original data in buffersPeriodic checkpoints allow system to 'roll-
back' to correct state
Page 33
45 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Recovery Blocks (1)
Acceptancetest
Algorithm 2
Algorithm 1
Algorithm 3
Recoveryblocks
Test forsuccess
Retest
Retry
Retest
Try algorithm1
Continue execution ifacceptance test succeedsSignal exception if allalgorithms fail
Acceptance testfails – re-try
Page 34
46 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Recovery Blocks (2)
Force different algorithm to be used for each version so they reduce probability of common errors
However, design of acceptance test difficult as it must be independent of computation used
Problems with approach for real-time systems because of sequential operation of redundant versions
Page 35
52 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
HomeworkStudy Chapter 18 in detail using SQ3RRequired
By next Wed 10 Nov 2004For 30 points
20.1, 20.2, 20.4 – 20.6, 20.9 (@5) and pay attention to demands for examples
OPTIONALBy Wed 17 Nov 2004For up to 14 extra points, any or all of
20.10 (@3), 20.11 (@3) – details please20.12 (@8) – detailed answers to all
parts of this question
Page 36
53 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
DISCUSSION