FAILURE DR JOHN ROOKSBY
FAILURE
DR JOHN ROOKSBY
OR …
RESILIENCE
IN THIS LECTURE…
This lecture
• Will introduce you to many of the themes I will cover on the course.
• Will characterise failure as the norm rather than the exception in systems operation.
• Will outline why critical systems engineering must address organisational and human factors as well as technical issues.
• Will build upon the idea of socio-technical systems engineering introduced in the last lecture, and will introduce the idea of resilience engineering
A STORY
A professor has to give an important lecture. He wakes up late because his alarm clock fails to go off.
His wife has left the house already. Unfortunately she has left the kitchen tap running and it has flooded the floor.
The professor rushes to clean up the mess.
He gets to his car only to realise he has locked his car and house keys inside.
He has left a spare house-key with a neighbour – but the neighbour is away.
He phones his wife but she doesn’t answer.
A STORY
He calls a friend and asks for a lift, but the friend’s car is broken down.
The professor sets off for the bus, but then remembers there is a bus strike.
He calls a taxi, but the taxi company is overwhelmed because of the bus strike.
He gives up, calls work and cancels the lecture.
This story is adapted from Perrow C (1984) Normal Accidents. Living with High Risk Technologies Basic Books.
ABOUT FAILURE
Failure is a judgement
Failures are common
Failures often have multiple causes
Failures cascade
Some failures are more serious than others
Failures often have no ill effect
Failures can often be recovered from
Engineering cannot eliminate failure
Success is as complex as failure
FAILURE IS A JUDGMENT
What do we judge the exact failure to be?
• Failure to get to work? Failure to give lecture? The smaller failures that led to cancellation?
What do we judge to be a significant failure?
• Does cancelling a lecture matter?
• Can cancellation be corrected for?
Different perspectives can be taken on failure
• Different explanations often suit different purposes
• There may sometimes be no definite agreement about a failure, but this does not mean any interpretation will do.
Passport issuing 1998/9S
ou
rce
s:G
rap
h -
Th
e P
ass
po
rt D
ela
ys o
f S
um
me
r 1
99
9.
N
AO
Re
po
rt.
Ima
ge
s –
BB
C N
ew
s
FAILURES ARE COMMON
Errors and failures happen all the time, particularly in complex systems where there is a lot to go wrong.
How many errors have you made in the last half an hour?
If servers in a data center have 99.999% reliability, what are the odds that all will be working at any one time:
a) if it has 10,000 servers?
b) if it has 100,000 servers?
http://www.time.com/time/photogallery/0,29307,2036928_2218548,00.html
FAILURES OFTEN HAVE MULTIPLE CAUSES
There were multiple (mainly mundane) causes behind the lecture cancellation:
• Human error (leaving tap running, forgetting keys)• Practices and procedures (Waking up late, rushing)• Technical failure (Alarm clock, Car)• System design (Door allows you to be locked out)• Environment (Lives too far from work)• External failures (Bus strike, lack of taxi capability)• Planning (Relying on a single lecturer)
Who or what is responsible?
Who has responsibility?
http://gizmodo.com/5844628/a-passenger-airplane-nearly-flew-upside-down-because-of-a-dumb-pilot
FAILURES CASCADE
Complex systems have a high number of components and will be dependent on a high number of external factors. These interdependencies may not always be apparent.
Often the cause or causes of failure are at an order of remove from the failure itself
• A simplistic view is that there are chains of failure. A domino effect where one problem leads to another
• A more complex view is that failures have complex webs of causes and influences
• We may also view failures in terms of problems with defenses
Disasters often result from unfortunate coincidences and combinations of failure.
SWISS CHEESE MODEL
Hardware
Operation
Software
SOME FAILURES ARE MORE SERIOUS THAN OTHERS
It is often helpful to distinguish between faults, errors, failures, disasters and catastrophe. But there is no consistently used terminology.
Failure is a judgment
The seriousness of a failure is contextually dependent.
• Failure in a life-critical system vs in a word processor
• When is it acceptable for an aging component to fail?
• When is it acceptable to take risks (e.g. do maintenance)?
Engineers take different perspectives on failure. Some argue that all failures, no matter how small, should be taken seriously. Some argue we need systems to be “good enough”.
FAILURES OFTEN HAVE NO ILL EFFECT
An error or failure may happen many times with no ill effect.
• This can lead people to be complacent
• It may one day lead to disaster
For example the Columbia shuttle disaster occurred when foam damaged tiles on the shuttle
• Similar foam strikes had happened many times
• NASA couldn’t believe this strike would cause the loss of Columbia
FAILURES CAN OFTEN BE RECOVERED FROM
A disaster is rarely an instantaneous event. Often a disaster results from an unfortunate combination of failures and often these take place over a period of time.
• Failures can often be mitigated
• Failures can often be recovered from
A resilient system is one that is able to recover from failures. It is the opposite of a brittle system.
We must give operators the ability to mitigate and recover from failure.
Image from: ATSB TRANSPORT SAFETY REPORT Aviation Occurrence Investigation – AO-2010-089 Preliminary
ENGINEERING CANNOT ELIMINATE FAILURES
Good engineering can greatly reduce but never eliminate the possibility of failure.
• Testing can be used to find problems but never show their absence
• Formal methods can be used to eliminate design faults but this does not mean problems will not emerge in manufacturing or system operation
Critical systems engineering must focus on operation as well as design.
Systems are increasing operated as services rather than products, so this risk is increasingly on the developers (!)
SUCCESS IS AS COMPLEX AS FAILURE
We need to learn from success, not just failure
• But success is even harder to define than failure.
Success is a judgment
• One person’s success is another’s failure
• A successful system may just be one that hasn’t yet failed
Success can be studied in terms of
• Noteworthy success
• Ordinary operation
• “Successful failures”
SOCIO-TECHNICAL SYSTEMS ENGINEERING
Software Engineering
Organisations
People and Processes
Communications + Data Management
Operating Systems
Equipment
ApplicationsSocio-TechnicalSystemsEngineering
Society
RESILIENCE
Design for failure
• How can a system fail gracefully and appropriately?
Design for recovery
• How can a system be designed to support mitigation and recovery from failure?
Design for avoidance
• How can we reduce the number of failures a system will encounter?
For all of these we need to understand systems operation. Critical systems engineering is not just about the design process, but also about understanding operation.
Microsoft “containerised” data centre
SUMMARY
1. Failure is the norm, not the exception
2. Resilient systems are able to cope with, recover from and avoid failure
3. Resilience is a socio-technical, not technical problem
HOMEWORK
First read
Chapter 3 “The Human Contribution” from J Reason (2008) The Human Contribution. Farnham, Ashgate.
Then
Make a note of any interesting slips, lapses, mistakes, violations, etc. that you have made recently