Top Banner
FAILURE DR JOHN ROOKSBY
29
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS5032 Lecture 2: Failure

FAILURE

DR JOHN ROOKSBY

Page 2: CS5032 Lecture 2: Failure

OR …

Page 3: CS5032 Lecture 2: Failure

RESILIENCE

Page 4: CS5032 Lecture 2: Failure

IN THIS LECTURE…

This lecture

• Will introduce you to many of the themes I will cover on the course.

• Will characterise failure as the norm rather than the exception in systems operation.

• Will outline why critical systems engineering must address organisational and human factors as well as technical issues.

• Will build upon the idea of socio-technical systems engineering introduced in the last lecture, and will introduce the idea of resilience engineering

Page 5: CS5032 Lecture 2: Failure

A STORY

A professor has to give an important lecture. He wakes up late because his alarm clock fails to go off.

His wife has left the house already. Unfortunately she has left the kitchen tap running and it has flooded the floor.

The professor rushes to clean up the mess.

He gets to his car only to realise he has locked his car and house keys inside.

He has left a spare house-key with a neighbour – but the neighbour is away.

He phones his wife but she doesn’t answer.

Page 6: CS5032 Lecture 2: Failure

A STORY

He calls a friend and asks for a lift, but the friend’s car is broken down.

The professor sets off for the bus, but then remembers there is a bus strike.

He calls a taxi, but the taxi company is overwhelmed because of the bus strike.

He gives up, calls work and cancels the lecture.

This story is adapted from Perrow C (1984) Normal Accidents. Living with High Risk Technologies Basic Books.

Page 7: CS5032 Lecture 2: Failure

ABOUT FAILURE

Failure is a judgement

Failures are common

Failures often have multiple causes

Failures cascade

Some failures are more serious than others

Failures often have no ill effect

Failures can often be recovered from

Engineering cannot eliminate failure

Success is as complex as failure

Page 8: CS5032 Lecture 2: Failure

FAILURE IS A JUDGMENT

What do we judge the exact failure to be?

• Failure to get to work? Failure to give lecture? The smaller failures that led to cancellation?

What do we judge to be a significant failure?

• Does cancelling a lecture matter?

• Can cancellation be corrected for?

Different perspectives can be taken on failure

• Different explanations often suit different purposes

• There may sometimes be no definite agreement about a failure, but this does not mean any interpretation will do.

Page 9: CS5032 Lecture 2: Failure

Passport issuing 1998/9S

ou

rce

s:G

rap

h -

Th

e P

ass

po

rt D

ela

ys o

f S

um

me

r 1

99

9.

N

AO

Re

po

rt.

Ima

ge

s –

BB

C N

ew

s

Page 10: CS5032 Lecture 2: Failure

FAILURES ARE COMMON

Errors and failures happen all the time, particularly in complex systems where there is a lot to go wrong.

How many errors have you made in the last half an hour?

If servers in a data center have 99.999% reliability, what are the odds that all will be working at any one time:

a) if it has 10,000 servers?

b) if it has 100,000 servers?

http://www.time.com/time/photogallery/0,29307,2036928_2218548,00.html

Page 11: CS5032 Lecture 2: Failure

FAILURES OFTEN HAVE MULTIPLE CAUSES

There were multiple (mainly mundane) causes behind the lecture cancellation:

• Human error (leaving tap running, forgetting keys)• Practices and procedures (Waking up late, rushing)• Technical failure (Alarm clock, Car)• System design (Door allows you to be locked out)• Environment (Lives too far from work)• External failures (Bus strike, lack of taxi capability)• Planning (Relying on a single lecturer)

Who or what is responsible?

Who has responsibility?

Page 12: CS5032 Lecture 2: Failure

http://gizmodo.com/5844628/a-passenger-airplane-nearly-flew-upside-down-because-of-a-dumb-pilot

Page 13: CS5032 Lecture 2: Failure

FAILURES CASCADE

Complex systems have a high number of components and will be dependent on a high number of external factors. These interdependencies may not always be apparent.

Often the cause or causes of failure are at an order of remove from the failure itself

• A simplistic view is that there are chains of failure. A domino effect where one problem leads to another

• A more complex view is that failures have complex webs of causes and influences

• We may also view failures in terms of problems with defenses

Disasters often result from unfortunate coincidences and combinations of failure.

Page 14: CS5032 Lecture 2: Failure

SWISS CHEESE MODEL

Hardware

Operation

Software

Page 15: CS5032 Lecture 2: Failure

SOME FAILURES ARE MORE SERIOUS THAN OTHERS

It is often helpful to distinguish between faults, errors, failures, disasters and catastrophe. But there is no consistently used terminology.

Failure is a judgment

The seriousness of a failure is contextually dependent.

• Failure in a life-critical system vs in a word processor

• When is it acceptable for an aging component to fail?

• When is it acceptable to take risks (e.g. do maintenance)?

Engineers take different perspectives on failure. Some argue that all failures, no matter how small, should be taken seriously. Some argue we need systems to be “good enough”.

Page 16: CS5032 Lecture 2: Failure
Page 17: CS5032 Lecture 2: Failure

FAILURES OFTEN HAVE NO ILL EFFECT

An error or failure may happen many times with no ill effect.

• This can lead people to be complacent

• It may one day lead to disaster

For example the Columbia shuttle disaster occurred when foam damaged tiles on the shuttle

• Similar foam strikes had happened many times

• NASA couldn’t believe this strike would cause the loss of Columbia

Page 18: CS5032 Lecture 2: Failure

FAILURES CAN OFTEN BE RECOVERED FROM

A disaster is rarely an instantaneous event. Often a disaster results from an unfortunate combination of failures and often these take place over a period of time.

• Failures can often be mitigated

• Failures can often be recovered from

A resilient system is one that is able to recover from failures. It is the opposite of a brittle system.

We must give operators the ability to mitigate and recover from failure.

Page 19: CS5032 Lecture 2: Failure

Image from: ATSB TRANSPORT SAFETY REPORT Aviation Occurrence Investigation – AO-2010-089 Preliminary

Page 20: CS5032 Lecture 2: Failure

ENGINEERING CANNOT ELIMINATE FAILURES

Good engineering can greatly reduce but never eliminate the possibility of failure.

• Testing can be used to find problems but never show their absence

• Formal methods can be used to eliminate design faults but this does not mean problems will not emerge in manufacturing or system operation

Critical systems engineering must focus on operation as well as design.

Systems are increasing operated as services rather than products, so this risk is increasingly on the developers (!)

Page 21: CS5032 Lecture 2: Failure
Page 22: CS5032 Lecture 2: Failure

SUCCESS IS AS COMPLEX AS FAILURE

We need to learn from success, not just failure

• But success is even harder to define than failure.

Success is a judgment

• One person’s success is another’s failure

• A successful system may just be one that hasn’t yet failed

Success can be studied in terms of

• Noteworthy success

• Ordinary operation

• “Successful failures”

Page 23: CS5032 Lecture 2: Failure
Page 24: CS5032 Lecture 2: Failure
Page 25: CS5032 Lecture 2: Failure

SOCIO-TECHNICAL SYSTEMS ENGINEERING

Software Engineering

Organisations

People and Processes

Communications + Data Management

Operating Systems

Equipment

ApplicationsSocio-TechnicalSystemsEngineering

Society

Page 26: CS5032 Lecture 2: Failure

RESILIENCE

Design for failure

• How can a system fail gracefully and appropriately?

Design for recovery

• How can a system be designed to support mitigation and recovery from failure?

Design for avoidance

• How can we reduce the number of failures a system will encounter?

For all of these we need to understand systems operation. Critical systems engineering is not just about the design process, but also about understanding operation.

Page 27: CS5032 Lecture 2: Failure

Microsoft “containerised” data centre

Page 28: CS5032 Lecture 2: Failure

SUMMARY

1. Failure is the norm, not the exception

2. Resilient systems are able to cope with, recover from and avoid failure

3. Resilience is a socio-technical, not technical problem

Page 29: CS5032 Lecture 2: Failure

HOMEWORK

First read

Chapter 3 “The Human Contribution” from J Reason (2008) The Human Contribution. Farnham, Ashgate.

Then

Make a note of any interesting slips, lapses, mistakes, violations, etc. that you have made recently