CRANFIELD UNIVERSITY Alexander Tomczynski APPLICATION OF RESILIENCE ENGINEERING CONCEPTS TO THE MANAGEMENT OF AIRWORTHINESS DEFENCE ACADEMY - COLLEGE OF MANAGEMENT AND TECHNOLOGY Military Aerospace and Airworthiness MSc Academic Year: 2013 - 2014 Supervisor: Dr Simon Place March 2014
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CRANFIELD UNIVERSITY
Alexander Tomczynski
APPLICATION OF RESILIENCE ENGINEERING CONCEPTS TO
THE MANAGEMENT OF AIRWORTHINESS
DEFENCE ACADEMY - COLLEGE OF MANAGEMENT AND
TECHNOLOGY
Military Aerospace and Airworthiness
MSc
Academic Year: 2013 - 2014
Supervisor: Dr Simon Place
March 2014
CRANFIELD UNIVERSITY
DEFENCE ACADEMY - COLLEGE OF MANAGEMENT AND
TECHNOLOGY
Military Aerospace and Airworthiness
MSc
Academic Year 2013 - 2014
Alexander Tomczynski
APPLICATION OF RESILIENCE ENGINEERING CONCEPTS TO
THE MANAGEMENT OF AIRWORTHINESS
Supervisor: Dr Simon Place
March 2014
This thesis is submitted in partial fulfilment of the requirements for
Figure 2-1 Accident Analysis and Risk Assessment Methods .......................... 13
Figure 2-2 Three Tracks on the Evolution of Safety Theory ............................. 13
Figure 2-3 The ‘Cynefin’ Framework – Complexity and Risk Management ..... 21
Figure 2-4 General Form of a Model of Socio-technical Control....................... 24
Figure 2-5 The Four Cornerstones of Resilience .............................................. 29
Figure 2-6 Conceptual Framework for Resilience Engineering ........................ 30
Figure 2-7 Framework for managing the impact organisation, technology and human factors have on safety management systems ................................ 37
Figure 2-8 FRAM Function ............................................................................... 38
Figure 4-1 FRAM Model Visualisation Demonstrating Taxonomy .................... 49
Figure 4-4 Instances of Functional Output Variability Recorded in Occurrence Reports 2012/13 ........................................................................................ 58
Figure 4-5 Instances of Reported Functional Output Variability by Function Type .................................................................................................................. 59
Figure 4-6 Total Instances of Functional Output Variability Recorded in Occurrence Reports 2012/13 ..................................................................... 59
Figure 5-2 A Function and Its Aspects ............................................................. 79
Figure 5-3 Screen Capture of Visualisation Tool with Functions Added ........... 81
Figure 5-4 Screen Capture of Visualisation Tool with External Dependencies Added ........................................................................................................ 83
Figure 5-5 5-6 Screen Capture of Visualisation Tool with all Functional Activities Shown........................................................................................................ 85
Figure 5-7 Activities and Dependencies Linked to Aspects of the ‘Train Maintenance Personnel’ Function ............................................................. 86
Figure 6-10 Instantiation of Rigging Pin Occurrence ...................................... 118
Figure 7-1 Tornado Process for Emergent Airworthiness Issues ................... 122
Figure 7-2 Current Theoretical Basis for Tornado Airworthiness Risk Management ............................................................................................ 124
Figure 7-3 Proposed Functional Resonance Risk Management Theory - Visualisation of a Generic Hazardous Process ........................................ 125
Figure 7-4 FRAM Model Risk Assessment Process ....................................... 129
Figure 7-5 Operation of Components Beyond Cleared Life - First Stage Risk Visualisation, Excluding Background Functions ...................................... 132
Figure 7-6 Visualisation of Hazard Generation Process ................................. 141
Figure 7-7 Visualisation of Potential Accident Processes ............................... 143
xi
Figure 7-8 Proposed Risk Management Process ........................................... 148
Figure 8-1 Fractal Property of the FRAM - Function Decomposed into Lower Level Functions ....................................................................................... 159
Figure 8-2 TASM Development Pathway ....................................................... 167
xii
LIST OF TABLES
Table 2-1 Herrera 's Ages of Safety Theory ..................................................... 12
Table 2-2 Benefits and Criticisms of Probabilistic Risk Assessment ............... 17
Table 2-3 Examples of Resilience Engineering in Practice .............................. 31
Table 3-1 D-ASOR Classifications included in Data ......................................... 45
Table 4-1 Example FRAM frame for Fault Diagnosis ....................................... 52
Table 4-2 Listing of TASM Functions ............................................................... 55
Table 4-3 Summary of Internal Variability ........................................................ 60
Table 4-4 Summary of External Variability ....................................................... 61
Table 4-5 Example TASM Recording of Step 2a-c for Function 67 - Engine Fleet Monitoring .................................................................................................. 61
Table 4-6 Elaborate Description of Output Variability ....................................... 62
Large organisations exist to maintain, modify, provide resources, operate and
monitor aircraft fleets in order to keep them airworthy. The way in which this
multitude of functions is carried out has a variety of effects on the aircraft
system and the property of airworthiness. Those responsible for airworthiness
can only manage it indirectly by managing of the functioning of the organisation.
This is achieved by means of tasking maintenance or setting policy, defining an
organisational structure (including contracting out elements), providing
resources and conducting quality assurance. So whilst making engineering
assessments and specifying what physical actions are to be carried out on an
aircraft system is critical, the management of airworthiness is a wider
endeavour.
1.5 The Research Aim
The aim of this thesis is:
To apply resilience engineering concepts by producing a system
model of an airworthiness management organisation in order to
provide a tool to improve management of airworthiness.
1.6 Objectives
In order to achieve the aim the following research objectives were established:
Review the theoretical background to safety management and the
implications for airworthiness management.
Review the concepts of Resilience Engineering with an emphasis on
applying it to airworthiness management.
Establish a theoretical framework for a model of an airworthiness
management system.
Gather and use primary research data to establish and validate a model
of the airworthiness management system for the RAF Tornado Force.
Using the model, develop a tool to enhance the airworthiness
management system of the RAF Tornado Force.
6
1.7 Methodology Overview
A literature review of resilience engineering was carried out, which branched out
into source disciplines of systems thinking and engineering; control theory; non-
linear dynamics and complexity theory. A search for work in this area
addressing airworthiness or technical safety in other domains was conducted.
For the Tornado case study, the safety, airworthiness and assurance plans of
the various elements of the organisation were examined. Resilience
engineering provides a number of modelling techniques that could be applied to
the case study; these were assessed and down selected to the Functional
Resonance Analysis Method (FRAM). The system was assessed by semi-
structured interviews with key personnel as well as using a large amount of
information and experience gained from working within the system. The FRAM
Model was built within a spreadsheet and a separate model visualisation tool
was created using Microsoft Visio. This allowed for the identification of various
potential leading indicators for system safety. In order to validate the FRAM
model, specific case studies were required. Two incident reports and an
emergent airworthiness risk were selected for analysis.
1.8 Descriptions and Definitions
For simplicity the standard terminology as described within MAA02 – Military
Aviation Authority Master Glossary (MAA, 2012) is adopted for this thesis.
There are a number of minor differences in emphasis between terms used here
and in civil aviation or other domains; these are discussed where relevant.
1.9 Thesis Structure
This thesis is structured around the research objectives:
Chapter 2 describes the theoretical foundations for resilience engineering
in the context of the other theories of safety and safety engineering
practise in other domains. Potentially useful models are analysed.
Chapter 3 details the methodology for carrying out the primary research.
Chapter 4 describes the process for building the case study FRAM Model
– the Tornado Airworthiness System Model.
7
Chapter 5 describes the development of the FRAM visualisation tool.
Chapter 6 discusses how the FRAM Model may be used for incident
analysis with reference to two examples.
Chapter 7 gives a process for, and example of the FRAM Model as a risk
assessment tool.
Chapter 8 provides a general discussion of the case study exercise,
focussing on the applicability of Resilience Engineering to aspects of
airworthiness practise.
Chapter 9 provides some conclusions.
9
2 LITERATURE REVIEW
The literature review will examine arguments for broadening the scope of
airworthiness to address the complexities of managing modern aircraft,
maintenance and support organisations. Existing notions of cause, failure and
hazards are challenged as the theoretical background to resilience engineering
is described. Models and methods for understanding and managing the safety
and airworthiness of complex systems are examined using the paradigm of
resilience engineering.
2.1 Airworthiness in the Context of Safety
There are a number of definitions for the term airworthiness; all these have at
their core the need for the aircraft to be able to be operated in safety or as the
MAA has it; ‘without significant hazard’. Hazard is further defined as ‘an
intermediate state where the potential for harm exists’ (MAA, 2012b). The
hazard is said to lie between a cause (such as a technical or human failure) and
an accident. So whilst airworthiness is clearly a target for aerospace design
organisations to meet through satisfaction of certification standards, it is also an
element of system safety that requires management throughout the lifecycle of
the system. It is analogous to ‘technical safety’ or in other domains, which is
often separated ‘operational’ or ‘occupational’ safety.
2.1.1 Accident Investigations
The need to investigate loss of life or near misses is both a pragmatic and moral
choice. The conclusions drawn from such investigations are extremely
important at a human level but also critical to restoring system safety. It is
therefore vital for accident investigators to use mental and procedural models
that reflect the complexity of modern technologies. One of the largest accident
investigation agencies, the National Transportation Safety Board (NTSB)
determines a ‘probable cause’ in all its reports (Johnson and Holloway, 2004)
but ICAO recommends that ‘causes’ – plural are determined (ICAO, 2001).
This indicates a governing accident chain theory in the former organisation but
perhaps a slightly more sophisticated model in the latter. Various writers (De
10
Landre et al., 2006),(Coury et al., 2008) have proposed models or frameworks
in which multiple causes can be described in accident investigation. Much has
been written about the intersection between legal frameworks and accident
investigation methodologies. Dekker (2003) for example has described the
detrimental effect of the adversarial nature of justice. The rest of this chapter will
describe how assigning ‘root’ or probable cause to accidents is potentially
unhelpful in the context of complex systems. It follows therefore that notions of
blame or individual responsibility are often problematic to apply.
2.1.2 Initial and Type Airworthiness
Much of the airworthiness of a system is ‘designed-in’ before manufacture. This
involves specifications, systems configuration and assumptions on support and
maintenance philosophy. A structured systems engineering approach to safety
as described in ARP 4761 (SAE, 1996) is used to convince regulators that a
type certificate can be issued. The evolution of safety requirements and
regulation over a system’s lifecycle causes difficulty (Kelly and McDermid,
1999). Military aircraft in particular are often retained in service for many
decades. Whilst the technology may remain relatively constant, experience
shows that it is usual operational usage to evolve over the course of the
lifecycle. For this reason it is important to regularly adjust, validate and reassess
airworthiness assessments if the type airworthiness of a design is to be
maintained.
2.1.3 Safety Management
For many complex systems, the development of safety cases is a mandatory
requirement (MoD, 2007) and in particular for military airworthiness this is
governed by MAA Regulatory Article 1205 (MAA, 2013). The concept of a
safety case is the presentation or collation of a body of evidence to assure
interested parties that the system is safe. This body of evidence is collected and
organised according to mental or procedural models. The theoretical basis for
these models are the same theories of safety as described below. Safety
management systems are similarly structured according to the prevailing
11
theoretical approach to safety. An evolution in modelling requires an evolved
approach to safety management.
2.1.4 Continuing Airworthiness
Continuing airworthiness relates to the maintenance of a particular, safe system
state for each of the individual aircraft being managed (MAA, 2012b). Given that
it is never possible to comprehensively inspect/audit each aircraft before every
flight, there must be assumptions made as to the effect of organisational and
human interactions with the aircraft so as to maintain the system in a safe state.
Understanding maintenance system performance is critical to assuring
continued airworthiness. This achieved through a Continuing Airworthiness
Management Organisation (CAMO) which provides assurance that its specified
tasks are being undertaken successfully. This is primarily achieved through a
quality assurance system, which ensures that rigorous processes are
established (Casey, 2013).
2.2 A History of Safety Theory
Chapter One sketched out a chronological view of ‘Ages’ of safety theory. New
theories tend to gain traction as a result of the investigation to major accidents.
Herrera (2012) describes how safety theory has evolved across technological,
human factors, organisational and complexity ‘ages’, identifying key accidents
and ideas on a time line, which is summarised in Table 2-1:
12
Table 2-1 Herrera 's Ages of Safety Theory
Leonhardt et al (2009) presents breakdown of safety methodologies within a
Resilience Engineering White Paper. This document describes Technical,
Human Factors, Organisational and Systemic accident analysis and risk
assessment methods. Systemic models/methods are those that have recently
emerged to provide a means of analysing safety from a ‘complexity’ standpoint.
These are shown chronologically in Figure 2-1 with an expansion of each
abbreviation available within the glossary.
Time Accidents Technology Human Factors Organisational Complexity
1930s Domino Model
1940 - 50sFailure Mode Effects
Analysis (FMEA)
Human Factors
Design
Task Analysis
1960s Aberfan Colliery Disaster
Fault Tree Analysis (FTA) -
Minute-Man Missiles &
Boeing aircraft
Energy Barrier Model
Technique for Human
Error Rate Prediction
1970s
Flixborough & Seveso Chemical
Plants
Tenerife Aircraft Collision
Three Mile Island Nuclear Plant
Probalistic Risk
Assessment (WASH-1400
Reactor Safety Study)
Hazard & Operability
Analysis
Energy Damage and
Countermeasure Strategies
Man Made
Disaster
Information
Perspective
1980s
Bhopal Chemical Plant
Challenger Space Shuttle
Chernobyl Nuclear Plant
Kings Cross Railway
Piper Alpha Oil & Gas
Dryden Aviation
Crew Resource
Management
Safety Culture
Swiss Cheese
Model
Normal
Accident Theory
1990s
Warsaw Air Crash
Iraq Friendly Fire
Cali Air Crash
Arianne 5 - Space
Norne Air Crash
Longford Oil & Gas
Mandatory Safety Cases
(UK)Normal Deviations
Man,
Technology and
Organisation
Concept
Drift into Failure
Risk Influence
Model
High Reliability
Organisations
2000s
Uberlingen Air Crash
Columbia Space Shuttle
Helios Airways
Texas City Refinery
Nimrod Air Crash
Air France 447
Deepwater Horizon
Human Factors
Analysis &
Classification System
Failure of
Leadership,
Culture &
Priorities
Aviation Safety
Management
Systems
Resilience
Engineering
Theory of
Practical Drift
"Age" of Safety Theory
13
Figure 2-1 Accident Analysis and Risk Assessment Methods (Leonhardt et al,
2009)
Saleh et al (2010) present a slightly different narrative in the development of
safety theory. Whilst they note most of the same key ideas and developments,
they identify three tracks in safety theory leading towards the modern ‘system
and control theoretic’. These are illustrated below:
Figure 2-2 Three Tracks on the Evolution of Safety Theory (Saleh et al., 2010)
The tracks are not exhaustive and there is some cross coupling between ideas.
Herrera’s (2012) technological age can be likened to the middle track, the
defence in depth track is comparable to the organisational age whilst the top
14
track has many human factors elements but takes much from the current ‘age of
complexity’. The current state of the art is given as a systems engineering-
control theory approach. Saleh (2010) acknowledges that the literature in the
field is particularly fractured. This is perhaps because the various theories
emanate from disparate fields such as psychology, reliability, operations studies
and management.
2.2.1 Technological Age – Governing Philosophy
The predominant theme in the technological age of safety theory is that of a
‘chain of causation’; first visualised as a set of toppling dominos by Heinrich
(1950). Each domino represented a factor in the accident: Management
controls; failure of a man; unsafe acts or mechanical conditions; the accident;
injury. Once the first domino was toppled removal of either of the others would
prevent the final injury domino toppling. Related to this is the concept of an
accident or event chain, where causative elements or events link together to
form a chain, which if it had been broken would have prevented the accident. It
is unclear where this idea originated; it is perhaps a reflection that a linear view
of the world still represents the defining popular narrative for any major
accident. Leveson (2011) links this to an erroneous assumption that there is
always a cause for any given accident.
2.2.2 Technological Age – Tools
The notion of a linear event chain gave rise to methods of analysing system
safety or the related property of reliability. The Fault Tree Analysis (FTA)
methodologies were developed from reliability studies of the American
Minuteman missile system and quickly developed into a methodology for
analysing safety by defining the probability of an unsafe condition developing
(Herrera, 2012). Closely associated are event trees which define hierarchies of
events post a single initiating event (such as an unsafe condition). These
analyses use stochastic methods to forecast top level probabilities for accidents
caused by single or multiple failures lower down in the system. There is always
a mathematical audit trail from the top level system safety target, for example
hull loss probability in commercial aviation, down to individual system or
15
component reliability data or predictions. Importantly, modern system safety
assessments contain more qualitative information based on expert
understanding of systems; carried out through Functional Hazard Assessments
(FHAs) (Dalton, 1996). When analysing accidents using event chain type
models such as FTA, there is a question of how far back it is appropriate to go
in order to find an initiating event. Leveson (2011) argues that selection of
initiating events is often arbitrary in accident analysis. It has been accepted in a
large number of major accident reports that management commitment to safety
or ‘safety culture’ is a key factor in risk of accident (Dekker, 2005), yet there is
no clear way in which these vital considerations can be fitted into an event chain
model. Reason (1997) espouses a version of the event chain in the famous
‘Swiss cheese’ model of organisational accidents. Reason’s cheese has
become the de-facto mental model for understanding safety and accidents
within the military aviation community as shown by articles in the RAF’s Air
Clues in-house safety magazine demonstrate (Anon, 2011; Gale et al., 2013).
Whilst Haddon-Cave’s (2009) investigation into Nimrod addresses issues of
culture and complexity, his view of causation is essentially linear. Leveson
(2011) outlines why linear accident models of the technological age such as the
Swiss Cheese are no longer considered acceptable:
Direct Causality – there is a reliance on the notion that there is always a
linear relationship between event A causing event B.
Subjectivity in Selecting Events – The backward chain of events is
often shown to stop for a number of arbitrary reasons, which could
include familiarity with a particular event in the sequence (“We’ve seen
this before”), it deviates from a standard (component operates outside its
specification) or a lack of information (such as inability to understand a
human performance issue).
Subjectivity in Selecting Chaining Conditions – It is often not clear
which factors caused each other.
Discounting System Factors – Event chain models generally deal with
proximate causes and do not deal with issues such as culture or
16
organisational pressures which can pervade through a socio-technical
system.
A useful example of how this approach to accident analysis can prove
disastrous is given by Leveson (2011). She notes how an incident where a DC-
10 lost a cargo door (without loss of life) was attributed to the failure of a
baggage handler to close the door properly rather than a design floor meant
that two years later a similar incident resulted in the complete loss of a DC-10
near Paris in 1974.
2.2.3 Limits of Probabilistic Risk Assessment
Both civil and military airworthiness certification standards require certain safety
targets to be met. These targets are expressed in terms of probabilities,
principally probability of hull loss and death of passengers or crew; for military
aircraft this is specified in Regulatory Article 1230 – Design Safety Targets
(MAA, 2012a). There are various other targets regarding risk of harm to third
parties or other unsafe conditions – these are operating risks. Operating risks
are also commonly assigned qualitative risk levels; in the case of military
aviation this process is specified in Regulatory Article 1210 – Management of
Operating Risk to Life (MAA, 2012a). This regulation advises Platform
Operators and Project Teams to make use of Fault Tree Analysis to enable
calculation of these risks. For some UK military platforms this has resulted in
the introduction of ‘Loss Models’ to guide the assessment of new or emergent
risks. In the case of Tornado, the Loss Model (Sugden, 2011) is not a tool that
can be used in isolation for predictive risk assessment; rather it uses incident
statistics to provide a current picture of loss rates across the fleet (Woodbridge,
2012). The regulation and recommended practise (SAE, 2010; Lloyd and Tye,
1982) for both civil and military airworthiness and safety targets is for the use of
fault tree and dependency diagram models. These methods of probabilistic risk
assessment (PRA) are linear, which usefully provides for aggregation of total
risk. There are however a variety of issues to consider in their use. Apostolakis
(2004) provides a summary of some of the benefits and criticisms of PRA.
However in the case of airworthiness certification risk assessments the process
17
is generally based on a qualitative assessment of Functional Hazard Analysis
(FHA). FHA allows expert subjective analysis to provide an element of linkage
between various hazards. Equally Common Cause Analysis (CCA)
methodologies go some way to accounting for system-wide failure mechanisms.
The literature on resilience engineering disputes Apostolakis’ (2004) claim that
PRA deals effectively with true complexity.
Table 2-2 Benefits and Criticisms of Probabilistic Risk Assessment (Apostolakis,
2004)
Benefits Criticisms
Multiple failures considered
Increases likelihood of spotting complex failure interactions.
Facilitates communication.
Integrated Approach.
Identifies unknown areas for research.
Focuses risk management activity on key areas
Human actions during accident scenarios cannot be modelled.
Difficulty of quantifying software failures.
Cannot model safety culture.
Difficulty estimating design and manufacturing errors.
PRA models are essentially a product of the ‘technical era’ of safety science,
they assume linear behaviour and that the systems being analysed are
tractable; thus decomposable into independent subsystems. This remains the
de-facto approach to managing most complex socio-technical systems and
forms the basis of the safety case approach prevalent within many regulatory
environments. The fundamental assumptions that justify their use are
questionable when applied to complex socio-technical systems. The principle
concern is that the human element cannot be satisfactorily modelled using
Boolean logic, in systems where there are frequent interactions with humans,
whether operators, maintainers or design or support engineers this presents the
possibility that common cause failures will be built into the system and that the
relationships will be non-linear.
18
2.2.4 Human Factors
Herrera (2012) outlines how 20th century disasters such as Three Mile Island
and Flixborough showed that the event chain models were becoming
inadequate – the focus began to shift to human failing, with the human identified
as the number one unreliable component in the event chain. Herrera (2012)
highlights two trends in the age of human factors; studies concerned with
eliminating human error by design for human performance and studies into how
humans cope with disturbances.
2.2.5 Organisational
‘Man Made Disaster’ theory was the initiating scholarly theory behind
organisational accident theory (Saleh et al., 2010). This theory noted that within
a certain class of events known as ‘man made disasters’ there were multiple
events chains that reached a long back into the past and that management and
organisation were key factors in causing accidents. Saleh (2010) also notes
‘Normal Accident Theory’ and ‘High Reliability Organisations’ as key precepts of
the organizational accident. Normal accident theory notes that there are tight
couplings between interacting causal factors in complex system accidents and
that they cannot be predicted. This has been condemned as a somewhat
fatalistic view. Herrera (2012) sees High Reliability Organisation Theory as a
counter to Normal Accident Theory. This characterises successful organisations
as those operating complex systems with a very small number of accidents.
Saleh (2010) notes that the research highlights a number of common
characteristics of such organisations such as:
Preoccupation with failure and organizational learning.
Commitment to and consensus on production and safety as concomitant
organizational goals.
Organizational slack and redundancy.
These facets of successfully safe or high reliability organisations correspond to
aspects of ‘safety culture’ as described by Reason (1997) and others.
19
2.3 Complexity
Aircraft are complicated machines; they have many components interacting in a
multitude of combinations. Dekker (2011) holds that analytic reduction, as
practised within traditional linear safety analysis, is unable to describe how
system elements and processes behave when exposed to multiple
simultaneous influences. He also describes the key distinction between a
complicated system such as an aircraft, which could conceivably be
disassembled then reassembled by a single person and complex systems. A
complex system is one where the boundaries are ‘fussy’ (require highly detailed
definition) and the structure is intractable; an aircraft operated subject to human
factors, culture, regulatory and organisational factors is therefore complex.
Cilliers (2005) defines complex systems as those having the following
properties:
Large numbers of simple elements.
Dynamic, propagating and non-linear interactions; these define
behaviour which is emergent and cannot be understood by inspection of
components nor predicted by deterministic methods.
Open, exchanging energy and information with the environment.
Memory is distributed within the system, influencing behaviour.
Adaptive behaviour; without the intervention of external agents.
This study assumes that the complete aircraft system, incorporating its
operation and support is complex rather than simply complicated. It could also
be argued that the edition of extensive software within aircraft renders the
system complex. The safety management system and airworthiness
management in particular must deal with complexity.
For those charged with managing the safety of complex systems, understanding
models for accidents and studying post mortem analyses of accidents does not
present a comprehensive approach to prevention. It is generally accepted that
events, hazards and risks often combine in unexpected ways. Is it therefore
adequate to manage safety risk as a game of ‘whack-a-mole’; eliminating or
20
mitigating risks as and when they become apparent (Zarboutis and Wright,
2006)?
It may be argued that a proactive reporting culture does much to allow
elimination or mitigation of risks before they materialise. Heinrich’s (1950) ‘ice
berg’ model drives much of this effort to uncover previously unknown risk and
there is an indisputable logic which says that knowing about a risk is a first step
to eliminating or managing it. The continued history of complex accidents tells
us that this approach may never be completely effective in preventing
unexpected failure (Hollnagel, 2007). Leveson (2011) explains that the concept
of a High Reliability Organisation confuses notions of safety and reliability. Just
because individual components of a socio-technical system can be proven to be
individually reliable it does not follow that safety will necessarily emerge as a
system property. Systems may be reliable yet unsafe, such as the NASA Mars
lander which crashed because the designer failed to anticipate the interaction
between the software and mechanical systems. Equally it is possible for a
system to be unreliable yet safe where systems fail-safe.
2.3.1 Complexity Theory
Accident investigation or analysis of complex system failure requires a mental
model to be applied to the accident scenario (Hollnagel, 2011). Similarly
accident prevention through risk management uses modelling to understand
potential accidents. Hitchens (2003) describes how complexity is relative to the
observer’s frame of reference. Modelling complex systems requires judgement
as to the extent of elaboration or its converse; encapsulation. He proposes that
systems derive their degree of complexity from their variety, connectedness and
disorder. Socio-technical systems are increasing in complexity as a result of the
increased use of networks. Manson (2001) provides a useful review of
complexity theory, most of the branches of which have an antecedent in general
systems theory. Three main branches of complexity theory are identified;
‘algorithmic complexity’ which gives that complexity is defined by the difficulty in
describing system characteristics. ‘Deterministic complexity’ deals with chaos or
catastrophe theories which posit that stable complex systems may become
21
suddenly unstable ‘Aggregate complexity’ deals with how elements interact to
produce complexity. A key property of complex systems is that of emergence
which describes how system-wide characteristics cannot be computed by the
aggregation system component behaviour. Zabourtis (2006) highlights that
patterns that emerge from complex socio-technical systems which erode the
resilience of complex systems. Grøtan et al (2011) gives a good account of the
theoretical foundations of complexity and how they can be applied to risk
assessment; the ‘Cynefin’ Framework provides a summary.
Figure 2-3 The ‘Cynefin’ Framework – Complexity and Risk Management (Grøtan
et al., 2011)
Generally the literature shows that whilst linear thinking has reached its limits
within system safety science, complexity theory has yet to be completely
applied to the problem. Zabourtis (2006) identifies how complexity theories can
be used to replace HAZOPS type safety analyses. The key inputs should be:
How can system entities co-adapt?
What will the probable effect be on the whole?
How can such patterns be eliminated?
22
The output of such an analysis should therefore be some means of avoiding the
emergent harmful properties. Dekker (2011) advises that complexity theories
can be applied to accident investigation if the search for a single cause is
dropped and multiple narratives are allowed to overlap and on occasion
contradict each other. The nature of complexity defies analysis; Cilliers (2005)
writes on the ‘incompressibility’ of complex systems, in that the only reliable
model of a complex system is that which has the same level of detail as the
system itself. Clearly this is impractical, yet as any model will involve
simplification, disregarded elements may have non-linear effects and the
magnitude of the potential outcomes may be non-trivial. However Cilliers (2005)
also states that whilst modelling and computing complex systems will never be
sufficient, it is still necessary.
2.3.2 Systems Thinking and Systems Engineering
The concept of a system is well-established with roots in philosophy and
thermodynamic theories leading to theories and practise surrounding systems
engineering. Hitchens (2003) provides one definition:
A system is an open set of complementary, interacting parts with properties,
capabilities and behaviours emerging both from the parts and their interactions.
The concept of emergence is an important one; accidents are emergent system
states of disorder. Systems engineering involves the generation of models to
represent a system (Oliver et al., 1997). Leveson (2011) first describes how
safety ought to fit into systems engineering’s primary activities – Needs
Analysis, Feasibility studies, Trade studies, System architecture development
and Interface analysis. This is the basis for system safety assessments
employed in generating evidence for airworthiness certification as per ARP
4761 (Dalton, 1996). Saleh (2010) distinguishes between failure modes
attributable to component failure and those failures attributable to emergent or
interactive failures; his thesis is that a systems theoretic approach addresses
this second set of failures. However he raises concerns that formal systems
theoretic approaches such as co-ordinatability and consistency in hierarchical
and multilevel systems are yet to be fully applied to safety analysis. Leveson’s
23
(2011) Systems-Theoretic Accident Model and Processes (STAMP) uses
control theory and processes as the key to prevention of accidents. It
decomposes the system across the complete lifecycle, from concept to
disposal, into a series of control loops. The key to prevention of accidents is
said to be keeping the entire system in a state of equilibrium, which is achieved
by applying constraints to implement control. The model is said to more
effectively deal with software than traditional notions of failure. STAMP utilises
descriptions of control loops at technological subsystem level, human controller
level and socio-technical organisation level, shown in Figure 2-5. STAMP uses
a taxonomy of control loop failure modes as an audit check list. Salmon et al
(2012) compares STAMP to other models concluding that STAMP provides a
more comprehensive system description but it is difficult to incorporate human
failures into the model, which itself needs a highly developed understanding of
the whole system. This highlights the difficulty in applying theoretically strong
models of complexity to particular scenarios.
24
Figure 2-4 General Form of a Model of Socio-technical Control (Leveson,
2011)
2.3.3 Control Theory
STAMP (Leveson, 2011) suggests that safety can be treated as a control
engineering problem and Saleh (2010) identifies this idea as an important
corollary to the development of a systems thinking approach to safety.
Kontogiannis and Malakis (2012a) describe how the concept of a model with
control loops is fundamental to systems safety incorporating human and
organisational factors. Hollnagel and Woods (2005) produced an Extended
COntrol Model (ECOM) which describes generically how organisational
25
processes transfers downwards to directly interact and control the technological
system and hence alter its state. The Viable System Model (VSM) uses
cybernetics principles to describe how safety goals are transferred downwards
through an organisation and how output is controlled by various measures such
as audit (Espejo, 1989). Kontogiannis (2012a) combines these two models and
applies them to studying the accident involving the crash of flight AEW-241 in
December 1997. Like many control and systems models in the safety literature
Kontogiannis (2012a) highlights the difficulty of applying the models for the
purposes of accident prevention. Kontogiannis (2012b) also tries to apply these
principles in a case study involving emergency helicopter operations.
2.3.4 Non-Linear Dynamics
Control of complex socio-technical systems needs to address the problem of
non-linear behaviour. Bendat (1998) describes how physical and engineering
systems can be divided into linear and non-linear systems. A system is linear,
if for any inputs and and for any constants ,
Equation 1 – Linear System (Bendat, 1998)
[ ] [ ] [ ]
This leads to 2 properties:
Equation 2 - Additive Property (Bendat, 1998)
[ ] [ ] [ ]
Equation 3 – Homogeneous Property (Bendat, 1998)
[ ] [ ]
A non-linear system is therefore one where,
Equation 4 – Non Linear System; lack of Additive Property (Bendat, 1998)
[ ] [ ] [ ]
Equation 5 – Non Linear System; lack of Homogeneous Property (Bendat, 1998)
[ ] [ ]
26
This means that for a linear system with a random theoretical Gaussian
probability density function as an input (e.g. a normal distribution), the system
will transform that data and produce an output with a Gaussian probability
density function as an output. Bendat (1998) also makes the point that any
physical system will display non-linear properties if the input conditions are
suitably wide. As this is true for numerous examples in flight dynamics it is also
true for various instances in safety and reliability, where oversimplifying
assumptions are made regarding the condition of equipment and its interaction
with maintenance and operating organisations. Human behaviour often defies
mathematical modelling due to its complexity and non-linear properties. As
previously described, it is common for safety analyses and models to assume
linear behaviour. In fact complex socio-technical systems generally exhibit a
lack of additive and homogeneous properties; where different inputs combine to
produce unexpected and ‘out-of-control’ outputs resulting in accidents. This
explains some of the difficulties encountered in producing a workable approach
to human and organisational reliability, as outlined by Rasmussen (1997). Non-
linear effects explain the concept of emergence that is the behaviour of linear
systems are predictable and tractable, yet nonlinear systems produce
unexpected results. Grøtan (2011) outlines how this leads to the concept of
‘Black Swan’ events that are unexpected with a huge impact – such as a
catastrophic accident with a complex system. These are understandable in
retrospect but could not have been predicted. Leveson (2011) describes how
such accidents are as a result of non-linear interactions between components of
the system, whether human, organisational or technological. The key to
developing an improved method of managing safety and estimating risk will be
to understand and predict these non-linear interactions.
2.4 Resilience Engineering
The theory of resilience engineering is emerging as a response to the problems
posed to safety management and engineering by complexity theory and the age
of the organisational accident as described by Reason (1997). The central
theme is to move from a focus on failure, where notions of component reliability
27
are applied to complex systems, humans and organisations; to looking at how
systems can succeed under varying conditions. The literature on the subject is
somewhat fragmented, although a series of books has been published, which
bring together the key ideas. One of the aviation organisations embracing
resilience engineering is EUROCONTROL which is a multinational air traffic
management service provider with Leonhardt et al (2009) publishing a white
paper on the application of resilience engineering within the organisation. This
illustrates that there is a blurred line between ‘traditional resilience’ study as
applied to infrastructure, and resilience engineering which has emerged from
the study of safety. Hollnagel et al (2011) give a simple definition of resilience:
“Resilience is the intrinsic ability of a system to adjust its functioning prior to,
during, or following changes and disturbances, so that it can sustain required
operations under both expected and unexpected conditions.”
Woods and Hollnagel (2007) set the scene for resilience engineering. They
outline fundamentals which include a shift away from the traditional safety focus
on ‘what went wrong’ (hindsight) and what could go wrong (risk assessment) to
a focus on ‘what can go right’ for risk assessment and ‘what did go right’ for
accident analysis – also neatly summarised by Schafer (2012). Resilience
engineering also rejects the notion of human failure, error taxonomies and
reliability analysis of complex systems in favour of a theory that failures
represent either the breakdown in strategies for coping with complexity, or an
unfavourable combination of functional variability within a system (technological,
human or organisational). In resilience engineering, safety is redefined as the
ability to succeed under varying conditions. By observing how systems work
under everyday pressures, it should be possible to understand the level of
resilience in a system and how it might be engineered to increase this quality.
For the purposes of both accident investigation and risk assessment it is
necessary to move away from linear combinations of events to an
understanding of how a system might lose its dynamic stability and veer into an
accident trajectory (Hollnagel et al., 2007). In summary, there are four key
precepts to Resilience Engineering:
28
1. Performance conditions are always underspecified. Individuals
and organisations must therefore adjust what they do to match current
demands and resources. Because resources and time are finite, such
adjustments will inevitably be approximate.
2. Some adverse events can be attributed to a breakdown or
malfunctioning of components and normal system functions, but others
cannot. The latter can best be understood as the result of unexpected
combinations of performance variability.
3. Safety management cannot be based exclusively on hindsight,
nor rely on error tabulation and the calculation of failure probabilities.
Safety management must be proactive as well as reactive.
4. Safety cannot be isolated from the core (business) process, or
vice versa. Safety is the prerequisite for productivity, and productivity is
the prerequisite for safety. Safety must therefore be achieved by
improvements rather than by constraints.
These precepts define a theoretical approach drawn from various ideas about
organisational accidents and safety culture. The key development is the focus
on the functions within the system and the emphasis on improving their
combined performance, rather than a focus on the potential sources of hazards
and barriers for accident prevention. This positive standpoint is a key attraction
to the approach; the drive for operational performance improvement and safety
can be in synergy rather than in conflict. Hollnagel (2011) gives four
cornerstones to the practise of resilience engineering. The first is knowing what
to do to respond to everyday disturbances – the actual. The second is knowing
how to monitor potential threats from the environment and from the functioning
of the system itself – the critical. The third part of the practise is knowing what to
expect in terms of threats and opportunities in order to address potential.
Finally, the fourth ‘cornerstone’ is that of the ability to address the factual
through learning.
29
A slightly different conceptual framework for Resilience Engineering is
presented by Madni (2009); offering more concrete requirements for
operationalising the practise:
Responding
(Actual)
Learning
(Factual) Monitoring
(critical)
Anticipating
(Potential)
Knowing what has happened
Knowing what to do
Knowing what to look for
Knowing what to expect
Figure 2-5 The Four Cornerstones of Resilience (Hollnagel, 2007)
30
Figure 2-6 Conceptual Framework for Resilience Engineering (Madni, 2009)
2.4.1 Resilience Engineering as a Successor to Safety Management
Leonhardt et al (2009) puts the resilience engineering approach to safety
management simply:
The more likely it is that something goes right, the less likely it is that it goes
wrong.
Cambon (2006) provides a resilience framework for assessing safety
management systems; they propose a number of metrics based on Tripod
theory, which essentially measures the performance conditions under which the
SMS operates. The balance of these performance conditions is said to
determine the stability of the SMS. ‘Engineering’ implies design and
Beauchamp (2006) notes how this can be achieved through organisational
learning to provide organisational resilience; a model for guidance is provided.
Zarboutis (2006) describes how, analogous to Rasmussen’s (1997) approach to
organisational drift, resilience engineering can identify symptoms of an erosion
in resilience. Johansson (2008) provides a ‘quick and dirty’ approach to
evaluating resilience in systems; a helpful overview but does not prescribe
specific improvement or change activities. Stoker (2008) outlines a
comprehensive approach to the assessment of operational resilience,
effectively specifying a goal based hierarchy for elements contributing to
resilience; producing a check list approach. Whilst this is undoubtedly a
valuable activity, it is questionable whether it will be able to deal with the
emergence of safety issues.
2.4.2 Under Specification of Performance Conditions
Under specification of performance conditions, that is the factors that affect the
execution of a particular function is key concept in the literature (Hollnagel,
2007). In most organisations performance conditions are subject to control
through rules, with the idea that this will improve safety. Hale (2013) reviews the
literature on this, noting that there are two approaches; a classical top down
approach, punishing transgression and secondly a bottom up approach that
31
sees expert ability to adapt to changing circumstances as paramount.
Nathanael (2006) notes that it is impossible to make what happens in practise
match that which is espoused by officialdom; the key to generating resilience is
dialogue between the hierarchical levels.
2.4.3 Performance Variability
Resilience engineering regards performance variability as inherently useful; it
allows operations to continue in underspecified conditions. It also provides the
potential for coupling between functions where upstream performance variability
combines with downstream performance variability to grow in amplitude. This
phenomenon can be harnessed for system success or else it provides an origin
for safety risk ( Hollnagel, 2012).
2.4.4 Examples of Resilience Engineering in Practice
Resilience engineering is more theoretical than its name suggests and
discussion abounds over the practicality of implementing its precepts is
uncertain. However, its principles can be found in evidence where it was not
specifically applied. Table 2.3 provides a brief summary of some examples.
Table 2-3 Examples of Resilience Engineering in Practice
Industry Tools Insights
Process Industry
Survey of workforce using Principal Component Analysis
Shirali et al.(2013) attempt quantitative measurement of resilience at an organisational level. Only possible to measure the potential for resilience rather than resilience itself. The following variables are given as indicators:
Top management commitment
Just culture
Learning culture
Awareness and opacity
Preparedness
Flexibility
Process Industry
Bayesian Networks Resilience Dashboard
Pasman et al. (2013) define a holistic control methodology for plant safety using leading indicators derived from process measurements within the plant. Also use of process simulation tools to develop scenarios. Traditional
32
(not currently achievable)
HAZOP/FMEA analyses do not capture all potential accident scenarios. Key Points:
Technical resilience can be measured/simulated. Organisational factors less so.
Importance of leading indicators to enable response to variations
Difficulty in dealing with drift in safety metrics.
Safety Gains made through interdepartmental cooperation vs common cause failures.
Advocate extensive use of bow-ties.
Aviation Interviews, audit and expert analysis
An investigation into both the sources of resilience and sources of brittleness. Comparison of two comparable small air carriers. Identification through extensive interviews. Resilience and brittleness categorised and risk assessed (Saurin and Carim Junior, 2012).
Air Traffic Management
FRAM Analysis of a mid-air collision fatal accident. Provides notes on buffering capacity, flexibility, margins, tolerance and cross scale interactions. There was no root cause – aircraft and ATM was operating normally. The system was inadequate (de Carvalho, 2011).
Aviation Bayesian Belief Networks (BBN)
Examines the use of and qualification of experts to provide probability estimates for BBN. Hidden common causes in BBN – principally safety culture. Difficulty in estimating frequencies or probabilities of rare events. BBN assume the ‘Causal Markov Condition’ therefore common cause failures are difficult to deal with – maybe applying BBN to FRAM would solve this issue (Brooker, 2011).
Aviation FRAM Alaska Airlines flight 261 accident analysed to understand FRAMs performance against 5 key resilience characteristics: buffering capacity, flexibility, margin, tolerance, and cross-scale
33
interactions (Woltjer, 2007).
Railways FRAM Interdisciplinary safety analysis of complex socio-technological systems based on the Functional Resonance Accident Model: an application to railway traffic supervision (Belmonte et al., 2011).
Nuclear FRAM Specific case study surrounding a task to move Nuclear Fuel – a specific task analysis rather than a generic system approach (Lundberg, 2008).
2.4.5 Criticism of Resilience Engineering
Oxstrand and Sylvander (2010) argue that Resilience engineering is little more
than a rebranding of safety culture; they do not see how the practise can be
applied to the nuclear industry which already uses both PRA and human
reliability analyses in the licensing of nuclear plants. In this industry it is argued,
safety culture forms part of every operation. The nuclear industry defines safety
culture as:
“Safety Culture is that assembly of characteristics and attitudes in organisations
and individuals which establishes that, as an overriding priority, nuclear plant
safety issues receive the attention warranted by their significance.”
International Atomic Energy Authority (Edwards et al., 2013)
Clearly safety culture is fundamental to engineering resilience into a socio-
technical system. The theory of safety culture does not in of itself propose a
different conceptual framework for the origin of unsafe system performance.
Also some safety culture literature describes a requirement for safety to become
the overriding priority for an organisation (Edwards et al., 2013). Clearly this is
at odds with notions of efficiency-thoroughness trade-offs and the requirement
to increase the proportion of activities that ‘go right’ as a means for reducing the
number that ‘go wrong. Whilst Resilience Engineering draws on much of the
theory around safety culture, it goes a lot further in proposing ways in which
organisations can be designed, analysed and modified in order to deliver
34
resilience. Le Coze (2013) describes a number of criticisms of Resilience
engineering the foremost amongst these being scepticism over the need to
introduce a new vocabulary to safety science. He also notes that the social
concept of power is missing from the resilience literature, although it could be
argued that the exercise of social power could be modelled as a function or a
resource. He also notes that many have disagreed with the notion that
resilience engineering does not present anything new; it collects simply
connects a number of existing ideas, foremost of which is the High Reliability
Organisation concept. He does note that the proof of the concept will be in its
application to real systems – testing the worth of the ‘engineering’ aspect of the
theory. McDonald (2008) asserts that Resilience Engineering is attractive
because other models are weak. He notes that the theory needs to be further
unified and demonstrated in practical examples.
2.4.6 Resilience Engineering and Airworthiness
Current MAA (2011a) policy is based on the idea that airworthiness is made up
of four pillars: the safety management system, compliance with recognised
standards, competence (of people and organisations) and independent
assessment. All of these activities and qualities are likely to contribute to the
resilience of an airworthiness system. Wilson (2008) provides a system model
for resilience of an airworthiness system and presents a number of key ideas:
The requirement for ‘organisational mindfulness’ – a safety culture keen
to seek out areas of risk.
Balancing ALARP principles with ‘And Still Stay In Business’ which could
be thought of as an efficiency thoroughness trade off; as per Hollnagel
(2011).
Understand how the organisational boundaries contribute to safety;
dealing with outsourcing, partnering and regulation.
Translate strategies into management frameworks for managing
organisational risk – these can be represented by ‘framework diagrams’
that show the factors that impact on safety management systems.
35
This work was succeeded by a thesis by Wilison (2012) which produced a
framework called RISK2VALUE which provides an integrated management
framework and decision support tool kit which address both safety and value
management at an organisational level. A generic diagram shown at Figure 2-7
is provided to support decisions – the use of which is illustrated by means of an
extensive diagram mapping various relationships. The strength of this approach
is that it either provides a generic approach to an audit of airworthiness or would
guide the construction of a new system. Equally it provides an assessment of
socio-technical factors surrounding accidents. A criticism that could be levelled
at the tool is that the linkages between the elements are not explicitly defined
and it therefore unclear how changes would influence the path that the
organisation took through the diagram.
37
Figure 2-7 Framework for managing the impact organisation, technology and human factors have on safety management systems (Wilson, 2008)
38
2.4.7 Lean Resilience
Leondhart (2009) notes that modern business systems are largely premised on
‘just-in-time’ processes. This methodology increases efficiency and
consequently coupling between upstream and downstream functions. Individual
system boundaries are more difficult to define as, for example, maintenance
units become increasingly tightly dependent on supply chains. Carney (2010)
urged caution in the introduction of lean principles and envisaged a hybrid
between lean maintenance and a more traditional model. Resilience
engineering in other domains has shown that it is in fact possible to harness the
approach to introduce production improvement alongside safety (Hounsgaard,
2013). Lean methodology is profoundly linear in its thinking (Carney, 2010); this
methodology is easily deployable in a highly tractable system such as a
production line. In less tractable systems such as maintenance it is likely that
Resilience Engineering techniques will produce better results.
2.5 Functional Resonance Analysis Method
The resilience engineering literature lacks specific methodologies or tools for
practical implementation of resilience engineering principles. The notable
exception is Hollnagel’s (2012) Functional Resonance Analysis Method
(FRAM). This is a technique for building models of complex socio-technological
systems. It differs from STAMP, in that it is a method for generating a model
rather than a model. FRAM maps the system as a series of functions, defined
by their various ‘aspects’ and linked ‘activities’.
O
C
P
I
T
R
FUNCTION
Time Control
Output
ResourcesPreconditions
Input
Figure 2-8 FRAM Function
39
By analysing the output variability from each function and the extent to which
this variability is damped up-stream, it is possible to begin to understand how to
analyse system performance from a resilience engineering point of view. The
FRAM forms the basis of the case study in later chapters and is described in
detail in Chapter 4.
2.6 Quantifying Resilience
Most approaches to quantifying resilience rely on surveys and audit approaches
such as those described by Shirali (2013) or by Saurin (2012). However whilst
an overall system assessment is of value, system managers are interested in
particular risks and being able to quantify them and manage them towards
ALARP levels, as required by legislation. Within process industries a high
degree of automation can be achieved within intensive data collection and
monitoring. These aspects mean that it is comparatively easy to run simulations
and model different systems. Risks can therefore be assessed in a more
quantifiable manner Pasman (2013). A reliability approach to safety is easily
quantifiable through linear decomposition to produce probabilistic risk
assessment. By contrast it is much more difficult to provide quantitative
assessment using a resilience engineering approach. Luxhøj (2003) and
Williams (1996) present Bayesian Belief Networks as a potential solution to low
probability – high consequence risks. Slater (2013) has presented an approach
to nesting BBN within a FRAM model and hence providing a way of quantifying
risk analysis developed through FRAM. He presents this technique as an
alternative to HAZOPS for use in process and transport industry. Brooker
(2011) analyses BBN in the aviation domain, specifically focusses on the ability
of experts to provide accurate assessments of probability in the case of low
probability events. He notes the ‘Causal Markov Condition’ which is an
assumption in BBN that there is no common cause Failure mode across the
network; issues such as ‘safety culture’ are therefore difficult to address. Other
potential techniques for quantification are the use of fuzzy logic or fuzzy set
theory with the use of Monte Carlo simulation (Shirali, 2013). An approach to
quantifying resilience in the context of civil infrastructure is presented by Vugrin
40
(2009), providing a menu of control engineering methodologies that may be
suitable. The issue of data collection in more human centric systems remains a
barrier to expansion of this method. Quantification is the key if Resilience
Engineering is going to gain ground against more traditional risk assessment
techniques.
2.7 Concluding Remarks
The various ages of safety theory were all products of the technology of their
time. Now in an age characterised by networked technology it is clearly time to
fully address notions of complexity for the purpose of providing safe systems.
This is certainly the case for the new generation of civil and military aircraft.
Resilience Engineering appears to offer a different approach to previous
theories and models. In particular the notion that accidents emerge from
unforeseen combinations of varying functional performance is a powerful one. It
offers the prospect that analysis from this perspective might provide risk insights
that may otherwise be missed. It also rings true from experience within an
airworthiness environment. Notions of ‘accident trajectories’ and holes in
processes or defences do not resonate in the same way. There is an
opportunity to combine efforts in process improvement and efficiency with
safety strategies. Resilience engineering offers the theoretical framework and
FRAM provides a potential method. This will be explored in subsequent
sections. It remains the case however that there is some way to go to
operationalize Resilience Engineering; Madni (2009) lists the key issues:
Help organizational decision makers in making trade-offs between
severe production pressures, required safety levels and acceptable risk.
Measure organizational resilience.
Identify ways to engineer the resilience of organizations.
The following chapters outline a case study in which this approach is tested.
41
3 METHODOLOGY
3.1 Introduction
In order to meet the research aim it was necessary to choose a technique with
which to model an airworthiness management system. The literature review
revealed that the Functional Resonance Analysis Method (FRAM) was the best
way to practically apply resilience engineering principles. The FRAM therefore
formed the basis of the practical element of the research. A single case study
organisation was used, with an aspiration of delivering an operationally useful
tool to the organisation at the end of the project. The case study was conducted
in two stages:
Stage 1 – Construct a FRAM Model of the Airworthiness Management
System and concurrently develop a visualisation tool.
Stage 2 – Test the model using scenarios drawn from occurrence
reporting and potential in-service airworthiness risks.
The model was developed iteratively, using expert opinion and data from a
variety of sources.
3.2 Working Arrangements
A key difficulty reported by other FRAM practitioners has been understanding
‘work as done’ rather than ‘work as imagined’. This was mitigated by conducting
the research from within the case study organisation on a part time basis, whilst
working within the Force Operations Centre. Moreover, this was preceded by 9
years work in other roles in military airworthiness; including quality assurance,
process improvement and error investigation roles. This provided insight into
‘work as done’ practise. Whilst there was a risk of bias, this was mitigated to
some extent through exposing parts of the model to other workers within the
organisation for verification.
3.3 Research Interviews
Semi-structured interviews were conducted with 19 different workers across all
of the functions. The interviews were flexibly arranged at the interviewees work
42
location (generally offices but control rooms and tool stores were also visited). A
pre-briefing was provided in the form of a two sided A4 document, shown at
Appendix C. The average interview duration was around 30 minutes, giving a
rough total of around nine and a half hours of interview time over the course of
the project. The general interview structure was as follows:
Check understanding and clarify scope of the study.
Confirm that participant was currently engaged in the function as part of
their daily activity.
Check accuracy of each of the function aspects.
Open questioning to highlight particular areas of variability in the
‘aspects’ of the function.
Open questions to ascertain whether any aspects had been missed.
Open questions to ascertain whether participants work covered any
further relevant functions.
The following research interviews were conducted:
Deputy Continuing Airworthiness Manager
Engineering Authority – various team members.
Military Airworthiness Review Certificate team member.
It is not possible to attribute individual model elements to particular sources; the
iterative nature of FRAM development precludes this in an experimental project
of this nature. It is recognised that this is a weakness in the process, however
this is mitigated volume of cross checking required to ensure model
consistency. The visualisation tool provided the final check of model
consistency in that all aspects had to be connected to another function or to an
external resource – loose ends were not allowed.
3.5 Air Safety Information Management System Data
Data from the Air Safety Information Management System (ASIMS) was used to
provide information on the variability of output from various functions within the
model.
3.5.1 Data Extraction
Data was extracted from ASIMS using the ‘Search Reports’ facility (MAA,
2011a), which allows user to apply various filters to the database. This allowed
only Tornado DASORs to be considered, within an initial selected date range of
1 Jan 2006 to 1 Nov 2013. This range allowed consideration of a time period in
which organisational structures have been relatively stable e.g. since the end-
to-end logistics transformation process (2001-2006), where a large number of
previously in-house functions were outsourced to industry. This date range was
then reduced to the most recent 12 month period 1 November 2012 to 1
November 2013, once the work required to analyse each entry became
apparent. ASIMS uses a standard taxonomy to describe both Occurrence
Cause Groups (OCG) and event descriptors. The MAA describes OCG as the
“final link in the chain which caused the occurrence… the one and only final
cause” and event descriptors are other ‘events in the chain’ (MAA, 2011a).This
clearly represents a linear accident model rather than the complex model
represented by FRAM. That said, both OCG and event descriptors do provide a
useful indication as to instances where undesirable functional variability
occurred. A second issue was that the FRAM Model was limited to
airworthiness management rather than ‘flight safety’ or ‘air safety’ in totality.
Operator actions that affected the airworthiness of the aircraft were included in
45
the ASIMS download, this was because such incidents rely on the performance
of additional functions to maintain continuing airworthiness following harmful
variability in the ‘operate aircraft function’. For example an operating pilot might
inadvertently cause a flap over-speed; this then relies on fault reporting, and
corrective maintenance functions (amongst many others) to perform within
acceptable limits in order to restore airworthiness. Table 3-1 shows the cause
and event descriptors that were included in the ASIMS ‘search report’ filter and
consequently the downloaded data. Reports that were captured in the ASIMS
filters but that were found to have no airworthiness aspect were deleted. This
left a total of 426 reports for analysis.
Table 3-1 D-ASOR Classifications included in Data
Cause and Event Descriptor Sub-Categories Included in Download
Hostile Action Nil
Human Factors (ATC/ABM) Nil
Human Factors (Aircraft Operation)
Flap / Slat / Airbrake Overspeed, Fuel Management, Gear Overspeed, Inadvertent Operation, Incorrect In-flight Shutdown, Incorrect Switch / Control Selection / Position, Overcontrol, Overstress, Overtemp, Overtorque, Undercontrol, Access Not Closed, Equipment Not Secured, Incorrect Use of Emergency Equipment, Loose Article, Collision with Aircraft/Vehicle, Collision with Ground Object, Deep Landing, Downwash, Flap / Slat Overspeed, Gear Overspeed, Heavy Landing, Tail Strike, Blanks / Pins Not Removed, Missed on Walk Round, Wrong aircraft, Blanks / Pins Not Fitted, Chock Jump, Collision with Aircraft/Vehicle.
Human Factors (Maintenance) All
Human Factors (Ground Services) All
Human Factors (Other) Material Dropped into Open System, Material left in Aircraft or Engine, Access not Closed, Equipment not Secured, Incorrect Use of Emergency Equipment.
Not Positively Determined All
46
Organisational Fault All
Technical Fault All
Unsatisfactory Equipment All
3.5.2 Assignment of Related Functions to Incidents
Once ASIMS data had been exported into an MS Excel format, each report was
assigned to up to 3 functions to indicate that the occurrence was a result of
output variability from each of these functions. The three functions were
assigned in a rough order of proximity to the reported occurrence, for example
an incidence of nose wheel steering failure would show as the mechanical
system function as the first and ‘closest’ function to the occurrence because the
variation in this function’s output was what was being reported. However, it may
have been variability in the output of the electrical system function that caused
downstream variability in the mechanical system; in this case the electrical
system function would be recorded second. Only three functions were assigned
to enable expedient data processing; the output was designed to be a rough
indicator of reported functional variability so this simplification was deemed
acceptable. Assignment was a matter of judgement formed by reading the ‘Brief
Title’, ‘Description’, ‘Investigation and Rectification work’, ‘Other Equipment
Involved’, ‘Cause Narrative’ and ‘Cause Observations’ fields. As the taxonomy
and language used by report authors did not correspond directly to the FRAM
model, this had to be carried out manually, which precluded a full analysis of
each report due to the time required. In addition to assigning three FRAM
functions to each report, a further set of fields was added to the data to show
the number of couplings by type of function e.g. Human-Human, Technological-
Organisational, Human-Organisational, etc. It was recognised that these
couplings might not be ‘direct function to function’ couplings and that there may
be intermediary functions identified in the FRAM model. Results from this
process are presented as they used in chapter four.
47
4 BUILDING THE TORNADO AIRWORTHINESS SYSTEM
MODEL USING THE FUNCTIONAL RESONANCE
ANALYSIS METHOD
Chapter two described the theoretical background to resilience engineering and
identified the Functional Resonance Analysis Method (FRAM) as the most
practical way to apply the principles. As described in Chapter one, the RAF’s
Tornado GR4 fast jet aircraft fleet was used as a case study. The following
chapter describes how the Tornado Airworthiness System Model (TASM) was
constructed using the FRAM. The TASM was created within a Microsoft Excel
spreadsheet; Chapter 5 describes the accompanying Visualisation Tool, which
was developed concurrently with the spreadsheet model. A copy of the final
spreadsheet model is at Appendix A.
4.1 Basic Principles
A full description of the FRAM is given by Hollnagel (2012) and also on the
website www.functionalresonance.com (Hollnagel, 2014). Drawing on the
theoretical basis of resilience engineering already described, the basic
principles of FRAM are given as:
The Equivalence of Success and Failure. Things go right and wrong in
fundamentally the same way. Although outcomes may be different, the
underlying processes are not necessarily different.
Approximate Adjustments. Conditions under which work or activity is
conducted never entirely matches that which is prescribed. Systems
normally adjust performance approximately to match existing conditions.
This approximation results in performance variability.
Emergence. Variability is not normally enough to cause an accident.
Variability may combine in unexpected ways leading to disproportionately
large, non-linear outcomes.
Functional Resonance. Occasionally functions reinforce each other
and cause unusually high output variability. This coupling effect is called
functional resonance, which may spread through the system. The
Hollnagel (2012) identifies three classifications of function; Technological,
Human and Organisational. The difficulty of classifying each function varies.
Some, such as the function carried out by an aircraft system e.g. ‘Defensive
Aids’ were clearly technological. The attribution of either ‘human’ or
‘organisational’ characteristics to functions was largely down to the number of
people involved. Broadly defined functions such as ‘Supply Chain’ were clearly
organisational in nature. Others such as ‘Refuel/Defuel’ are carried out by only
one or two people and hence were classified as ‘human’. In other cases, such
as ‘Scheduled Maintenance’ this was less clear, as the function represented the
conglomeration of a number of human functions but also required organisation
with a hierarchical structure. As this sub-step only provided an initial pointer
O
C
P
I
T
R
FUNCTION
amplitude
1frequency
output variability
57
towards identifying output variability, these distinctions were not critical. As
described in chapter three, ASIMS data was used to show reported functional
variability. Figure 4-4 shows the number of times that functional variability was
reported. The majority of reports related to variability in technological functions,
which was because technological functions are those whose output has a most
direct impact on flight safety. In general, ASIMS reports did not identify
organisational or human factors related causes for occurrences. This is
because all incidents were purely related to the reliability of the technology or
that investigations did not probe deep enough into the incidents to uncover
these instances. Also the majority of occurrences did not result in any harm;
reporting of near misses due to human or organisational factors may not be
reported in the same ratio as reliability issues.
58
Figure 4-4 Instances of Functional Output Variability Recorded in Occurrence
Reports 2012/13
59
Figure 4-5 Instances of Reported Functional Output Variability by Function Type
Figure 4-6 Total Instances of Functional Output Variability Recorded in
Occurrence Reports 2012/13
4.8 Step 2b – Identify Internal Sources of Output Variability
Using system experience, interview data and ASIMS data described above, the
sources of internal variability for each function were noted and then
characterised. Internal sources of output variability are those which are
produced from within the function due to its inherent nature. Technological
functions may suffer component failure due to wear-out or human functions are
subject to a variety of psychological and physiological variations.
60
Table 4-3 Summary of Internal Variability (Hollnagel, 2012)
Possible internal sources of performance variability
Likelihood of performance variability
Technological Few, well known Low
Human Very many High frequency, large amplitude
Organisational Many, function specific or relating to ‘culture’
Low frequency, large amplitude
At this point in the analysis, notes were also made relating to any internal
damping mechanisms, for later reference. Damping mechanisms might include
internal redundancy in the case of technological functions, for instance a fail-
safe structure might continue to react loads to the full specification despite the
failure of one load pathway. In the case of an organisation, overlapping
responsibilities might provide cross checking of activity and reduce output
variability.
4.9 Step 2c – Identify External Sources of Output Variability
External output variability can be traced to some external dependency or linked
function in a process. The function ‘Ground Handling’ requires a variety of
resources in order for it to work (mechanics, drivers, tow tractor, etc.) and if
these aspects of the function vary in some respect then the potential exists for
the output of the ground handling function to also vary. For example, if the
ground handling team contained a particularly inexperienced worker then the
output of the function may potentially vary. Of course, damping factors whether
internal or external might remove this potential function output variability.
Damping factors could include additional supervision or time to complete the
task. As well as external variability within the defined function’s aspects (input,
precondition, resources, control and time) there are system-wide external
factors to consider that might exert influence on some or all functions, leading to
output variability. Such factors cannot be easily mapped in the FRAM Model;
they include environmental factors such as weather, infrastructure such as
heating, lighting, office space and IT reliability and also more intangible factors
61
such as cultural dimensions (such as ‘Just’, ‘Safety’ or ‘Reporting’ cultures).
Where external system-wide factors were potentially significant these were
noted at this step. The same data sources used for internal variability were also
used to produce notes on the external sources of variability for each function.
Table 4-4 Summary of External Variability (Hollnagel, 2012)
Possible external
sources of performance variability
Likelihood of performance variability
Technological Maintenance, misuse Low
Human Very many, social and
organisational High frequency, large
amplitude
Organisational Many, instrumental or
‘culture’ Low frequency, large
amplitude
Initial notes on internal and external sources of output variability were entered in
the FRAM Model as show in Table 4-5:
Table 4-5 Example TASM Recording of Step 2a-c for Function 67 - Engine Fleet
Monitoring
Type of Function: Internal Variability External Variability
Organisational
This contains a variety of technological and human judgement functions which combine to provide and overall organisational function
Internal variability is caused by human judgement elements of the function.
There is a variety of commercial and operational production pressures that influence this function. The ability and expertise of front line squadrons also provides context to the advice given out from Propulsion Support Team /Rolls-Royce.
4.10 Step 2d – Most Likely Dimension of Output Variability
Steps 2b and c identified the sources of output variability, the next step
characterised the potential output variability in its most likely dimensions. The
principles of conservation of energy and mass dictate that output must be in
some form of mass or energy transfer. For many functions this also provides for
62
some form of information transfer in various media (verbal, electronic, visual,
etc.). In order to keep the model at a manageable size, not all functional outputs
are described in exhaustive detail. The level of detail is in itself an ‘efficiency
thoroughness trade-off’; the validity of the judgement will be iteratively assessed
and adjusted as the model is used. All outputs were linked to aspects of other
functions, apart from the aircraft functions themselves which interact with the
external environment. The self-contained nature of the system provided a
mechanism for checking the internal consistency of the model – all outputs must
link to another function or to the external environment. Hollnagel (2012)
provides two options for characterising output variability; either a ‘simple’
solution or an ‘elaborate’ solution. The simple solution provides characterisation
in terms of time or precision. Given the broad scope of this model and the
potential wide range of activity covered by a single functional output line, the
elaborate solution was used to characterise output variability. Hollnagel (2012)
identifies 8 manifestations of output variability which are further divided into four
subgroups.
Table 4-6 Elaborate Description of Output Variability (Hollnagel, 2012)
Manifestation of Variability Description
Timing/Duration Too early/ too late/ omission.
Force/ Distance/ Direction Too weak/ too strong/ too short/ not far enough/ wrong direction/ too long too far/ wrong type of movement.
Wrong Object Wrong object or points to wrong object.
Sequence (of actions or information)
Omission, jumping, repetition, reversal, wrong part
Hollnagel (2012) emphasises the difference between actual variability and
potential variability. The main purpose of this model is to allow risk assessment;
potential variability is therefore the important issue and the subject of the initial
assessment that forms the basis of the model. Hollnagel describes potential
variability as what ‘could possibly go right or wrong’. Given the broad scope of
this model, this has been further clarified to the most likely potential variability.
This means that there is a steady state starting point from which the model can
63
be iteratively manipulated. The FRAM spreadsheet uses ‘drop-down’ selections
to allow allocation of ‘most likely’ output variability. The term ‘most likely’ allows
for the fact that some outputs may potentially be able to produce a variety of
manifestations. In particular instantiations of the model, these may not
correspond to the exact activity that is occurring. It is important to emphasise
that the model classifies the most likely output variability not the most likely
output, therefore there may be a more likely form of output but it is the rarer but
more variable form that is captured in the model. For example Table 4-7 shows
the characterisation of the output variability for flight servicing. One output of
this function is ‘replenishment of aircraft systems’ (with oils, greases and
gases). The description of the most likely output variability gives that an
omission or the wrong fluid may be used. Of course, in the majority of cases (or
instantiations) of this function in operation, the correct fluid will be used in the
correct quantity, hence exhibiting no variability.
These couplings only exist for finite periods of time and represent activities.
A process can be shown by an instantiation* of the model; showing a series of coupled functions forming a process.
The complexity of the system makes the model intractable if all processes are considered together. Using the interactive layers, various processes can be can be visualised and cross referred to the spreadsheet model.
The FRAM Spreadsheet Model contains information regarding likely variability of functional outputs; if inadequately controlled this variability may lead to hazards and accidents.
O
C
P
I
T
R
X
O
C
P
I
T
R
X
Function with potential to produce direct air safety hazards through their output
Function which directly affects condition of aircraft & equipment
O
C
P
I
T
R
X O
C
P
I
T
R
Z
O
C
P
I
T
R
Y
Instructions for Highlighting a Process: Click the ‘Layers’ Button above - Select a tick against each function identified as
a part of the process. Select a new colour for each function that has
been ticked – this must be the same colour for each function.
If you wish to also highlight the external processes involved both ‘0 - External’ and ‘0 – External Resources’ layers must be selected and given the same colour as the functions.
Tracking The Process Further into the System: Simply keep selecting and colouring functions. You can highlight all potential activities by
selecting the layer ‘0 – BLUE’ The background can be selected or deselected
using the ‘0 - Background’ layer.Printing an Instantiation: The Internet Explorer print function produces
a poor quality image. Instead, press Ctrl + Prt Scr on your keyboard and then paste into a word document, the use crop tool.
Purpose of the tool: To allow visualisation of specific instantiations* of the Tornado GR4 airworthiness system to enable air safety/airworthiness occurrence investigation, airworthiness risk assessment and system improvement activity
O
C
P
I
T
R
Function Name
Time
Preconditions
Input
Resources
Output
Control
00
Figure 5-10 Visualisation Tool Key
91
This tool allows processes to be investigated for the purpose of risk assessment
or incident investigation. When used with the TASM spreadsheet, experienced
engineers or safety managers will be able to assist in engineering resilience into
the Tornado airworthiness system by adjusting controls on existing processes
so as to prevent harmful functional resonance occurring. Such system
adjustments will need to be based on assessment of the risks posed by
particular hazards, which may only become apparent through investigation of
incidents using the tool. Examples of incident investigation and risk assessment
are given in chapters six and seven. System adjustments themselves may take
any form that alters the way in which particular functions perform. For example,
if a reliability problem arose with particular technical subsystem, resources may
be increased such as the provision of additional funding to procure more spares
to feed scheduled maintenance. Control of the maintenance function would
need to change through changing the output of the ‘provide approved data’
function. Whilst all of these things may have been done without the use of the
tools, it is hoped that FRAM will provide insights into ‘whole system’ operation
and emergent behaviour that would otherwise be difficult to achieve.
93
6 USING THE TORNADO AIRWORTHINESS SYSTEM
MODEL FOR INCIDENT ANALYSIS
Chapter four described Step zero in the FRAM used to build the Tornado
Airworthiness System Model (TASM); this specified that the main purpose of the
TASM was for risk assessment. However, using the visualisation tool allows
ready decomposition of the model into parts pertinent to particular incidents.
Particular processes can be highlighted, with other functions being left as
background functions on the assumption that their variability was not significant
in controlling the processes involved in the incident.
6.1 Case for Using FRAM for Incident Modelling
Chapter one discusses commonly applied accident models, whether they are
technological, human or organisational. The military air safety management
system uses an Occurrence Investigation process manned by local personnel to
understand any occurrences that had the potential to pose an unacceptable air
safety risk. For accidents or serious occurrences the MAA will convene Service
Inquiries to investigate using experts from the military air accident investigation
branch. Similar arrangements exist within civilian operators and regulators. The
purpose of applying FRAM to incident analysis is to provide a resilience
engineering perspective to understanding how incidents occurred and to
provide recommendations that are more likely than traditional methods to
prevent reoccurrence of similar or unrelated incidents. By understanding how
functional performance variability combined to produce an adverse outcome it
should be possible to understand how performance conditions might be shaped
or controlled to produce more desirable outcomes in the future. In order to
explore this hypothesis, two particular incidents that have occurred within the
RAF Tornado Force were selected and analysed. As the following analyses rely
only on data from existing occurrence reports, no new findings will be
highlighted – this chapter just demonstrates how incidents can be described
using the TASM.
94
6.2 Incident One – Thrust Reverser Incidents
Tornado employs a thrust reverse system to provide braking on landing in order
to slow the aircraft to safe taxying speeds. In the event thrust reversers fail to
operate, wheel braking may be used although this does increase the likelihood
of fire hazards from hot brakes, both to the aircraft and to ground crews.
Significantly thrust reverse is also required in the event of high a high speed
abort during take-off. Thrust reversers deploy as ‘clam-shell’ buckets directly
into the jet efflux, rear of the final nozzle in the RB199 engine exhaust system.
Figure 6-1 Tornado GR4 with Thrust Reversers Deployed (Cooke, 2004)
95
6.2.1 Description of Incidents
Tornado has experienced a recent history of thrust reverser incidents, some of
which are summarised here, using data taken from ASIMS:
Table 6-1 Thrust Reverser Air Safety Occurrence Reports 2012/13
The output of the propulsion system and the electrical system on which it relies
varied outside of the required performance envelope in that the thrust reverse
did not deploy because the upstream electrical function did not provide the
required output. In this case the electrical system output was an extreme case
of output variability – there was no power supplied to the thrust reverse circuit
when demanded. There were potentially other dimensions in which the
electrical output might have varied e.g. power, current, voltage etc.
Figure 6-3 Propulsion & Electrical System
Clearly this situation arose because the upstream maintenance functional
output meant that the electrical system was in the wrong configuration (CB
pulled). Thrust reversers are used on nearly every Tornado sortie – what then
was the key element of variability that made these instances different to most
other times the aircraft was operated? In each case, human maintenance
activity was required on a system that had an upstream connection with the
electrical system prior to the occurrence. Every Tornado flight requires a
significant degree of variable human functions to allow it to take place. Using
FRAM and the TASM an occurrence investigator needs establish:
How the functions came to combine in a manner that was potentially
hazardous to the system?
No Electrical Supply to Thrust ReversersO
C
P
I
T
R
Armament & Electrical Systems
48O
C
P
I
T
R
Propulsion
51
Dynamic Environment
No Reverse Thrust
103
Given functional output variability is normally sufficiently damped so as
not to produce a hazardous output (e.g. thrust reverse normally operates
correctly), what damping function that is normally present was not
adequate in this case?
The TASM visualisation tool can be used to trace back through the system to
identify where functional resonance has occurred. Figure 6-4 highlights
potentially functionally resonant activities, which can then be examined in the
FRAM Model – shown with red outlines in Table 6-4:
104
Figure 6-4 Electrical System Potential Functionally Resonant Activities
O
C
P
I
T
R
Maintenance Personnel
O
C
P
I
T
R
Locally Manufacture
Parts
O
C
P
I
T
R
Publish SI(T)sO
C
P
I
T
R
Engine Performance Monitoring
18
31 668
O
C
P
I
T
R
Repair Spares – Industry
28O
C
P
I
T
R
Repair/Maintain Spares R2
34O
C
P
I
T
R
Demand & Return Spare
Parts
39O
C
P
I
T
R
Independent Inspection
59O
C
P
I
T
R
Record Work done on Aircraft
6O
C
P
I
T
R
Defer Faults
17 O
C
P
I
T
R
Tools & Test Equipment
13O
C
P
I
T
R
OperateShift Pattern
64
O
C
P
I
T
R
Co-ordinate Maintenance
Documentation
16 O
C
P
I
T
R
Ground Handling
3
O
C
P
I
T
R
Fuel/Defuel
14
O
C
P
I
T
R
Avionic Communicatio
ns
47 O
C
P
I
T
R
Avionic Flight Systems
45
O
C
P
I
T
R
Software
65
O
C
P
I
T
R
Mechanical Systems
49
O
C
P
I
T
R
Armament & Electrical Systems
48O
C
P
I
T
R
Propulsion
51O
C
P
I
T
R
Replacement of service life
limited parts
26O
C
P
I
T
R
Scheduled Maintenance
2
O
C
P
I
T
R
Task Maintenance
5
O
C
P
I
T
R
Flight Servicing
1
O
C
P
I
T
R
Pre-Flight Checks
55O
C
P
I
T
R
Operate Aircraft
54
O
C
P
I
T
R
Crew Escape System
52O
C
P
I
T
R
Aircraft Structure
50
O
C
P
I
T
R
Supply Chain
10
O
C
P
I
T
R
Acquire Spare Parts
33O
C
P
I
T
R
Store & Maintain
Weapons & RE
38
O
C
P
I
T
R
Structural Inspections
41
O
C
P
I
T
R
Repair Aircraft
40O
C
P
I
T
R
Corrective Maintenance
43
O
C
P
I
T
R
Apply SI(T)s
24 O
C
P
I
T
R
Fault Diagnosis
42
O
C
P
I
T
R
Supervise Maintenance
58
O
C
P
I
T
R
Handover
57
O
C
P
I
T
R
Fit/Remove Role & Arm Equipment
11
O
C
P
I
T
R
Rectification and Line
Control Boards
61
O
C
P
I
T
R
Ground Services
20
O
C
P
I
T
R
Report Faults & Husbandry
25
O
C
P
I
T
R
Weapons
53 O
C
P
I
T
R
Defensive AIds
46
O
C
P
I
T
R
Survival Equipment
56
O
C
P
I
T
R
Train Maintenance
Personnel
7
O
C
P
I
T
R
Airworthiness Review
Certification
27
O
C
P
I
T
R
Chief Air Engineer
69
O
C
P
I
T
R
Force & A4 Operations
21
O
C
P
I
T
R
Plan Weekly/Daily Flying Programme
60
O
C
P
I
T
R
Occurrence Reporting
9
O
C
P
I
T
R
Maintain GSE
12
O
C
P
I
T
R
Configuration Management
(LITS)
63O
C
P
I
T
R
Manage Maintenance
Extensions
62
O
C
P
I
T
R
Technical Assistance
Process
44
O
C
P
I
T
R
Modify Aircraft
23O
C
P
I
T
R
Monitor Reliability Data
35
O
C
P
I
T
R
Maintenance Programme
Development
22
O
C
P
I
T
R
Independent Technical
Advice
37O
C
P
I
T
R
Publish Approved Data
30
O
C
P
I
T
R
Release to Service
36O
C
P
I
T
R
Engine Health Monitoring
19O
C
P
I
T
R
3 Month Flying Programme
4
O
C
P
I
T
R
Engine Fleet Monitoring
67
O
C
P
I
T
R
Publish Aircrew Publications
29O
C
P
I
T
R
Cost/Benefit and Hazard
Analysis
32
O
C
P
I
T
R
Design Organisations
68
No Electrical Supply to Thrust Reversers
105
Table 6-4 FRAM Model of Electrical System
Name of Function Armament & Electrical SystemsAspect Description of Aspect Number Name Aspect
Input Ground Services (Electrical Power Generation) 20
Ground Services (Cooling,
Power, Dehumidification,
Steps, Staging, Bungs,
Blanks)
AC connected/removed
to ground services - Arm
Elect, Structure, Mech
Sys,
SequenceOmissions - items left
attached or not fittedMedium Medium
No electrical output during
maintenanceINCREASE 12
Propulsion System (Electrical Power Generation) 51 Propulsion Electrical Power Force/ Distance/Direction Fail to provide power Low High No electrical power INCREASE 9
Output Electrical Power/Signals
Precondition Apply Special Instructions (Technical) 24Apply Special Instruction
(Technical)
Special Instruction
(Technical) Applied to
applicable
Timing/DurationInstruction not complied
within specified timeHigh High Unsafe condition develops INCREASE 27
System Loads are Force/ Distance/Direction Fails to react load Low High
Potential for electrical shorting or
sparkingINCREASE 9
Control Operate Aircraft 54 Operate AircraftInputs to aircraft
systemsForce/ Distance/Direction Incorrect control input High High Potential for unsafe condition INCREASE 27
Time Not initially described NO CHANGE 0
Possible effect on this (downstream)
Function Output Variability
(Damping)
Rough Downstream
Function Variability Score
Not initially described
Upstream Function Most Likely Dimension of
Upstream Output Variability
Description of Most Likely
Upstream Output Variability
Frequency of Upstream
Output Performance
Variability
Amplitude of Upstream
Output Performance
Variability
Possible effect on this
(downstream) Function
106
It is important to note that the visualisation tool automatically highlights all
activities which are linked to the electrical system function and any other
function identified in Table 6-3; this does not necessarily mean that these
activities were functionally resonant. To understand the relationship further it is
necessary to compare the model data to the occurrence report described
above. This shows that neither the operator function (pilot) nor the propulsion
system (providing power) output variability was significant during the
occurrence. This left the Apply Special Instructions (Technical), Corrective
Maintenance and Pre-Flight Checks aspects of the Electrical system function.
These three upstream functions are linked by various activities to the ‘pre-
condition’ aspect of the Electrical System function. In this occurrence, all of
three of these functions should have resulted in the CBs being correctly set.
The variation in their functional output meant that the CBs were incorrectly set
and the preconditions (otherwise termed ‘execution conditions’) for the Electrical
System were not present and therefore the electrical signal was not sent to the
thrust reverse element of the propulsion system. Table 6-5 shows these three
preconditions highlighted within the FRAM Model.
107
Table 6-5 Electrical System Precondition Variability
Name of Function Armament & Electrical Systems
Aspect Description of Aspect Number Name Aspect
Input Ground Services (Electrical Power Generation) 20
Ground Services (Cooling,
Power, Dehumidification,
Steps, Staging, Bungs,
Blanks)
AC connected/removed
to ground services - Arm
Elect, Structure, Mech
Sys,
SequenceOmissions - items left
attached or not fittedMedium Medium
No electrical output during
maintenanceINCREASE 12
Propulsion System (Electrical Power Generation) 51 Propulsion Electrical Power Force/ Distance/Direction Fail to provide power Low High No electrical power INCREASE 9
Output Electrical Power/Signals
Precondition Apply Special Instructions (Technical) 24Apply Special Instruction
(Technical)
Special Instruction
(Technical) Applied to
applicable
Timing/DurationInstruction not complied
within specified timeHigh High Unsafe condition develops INCREASE 27
50 Technological Aircraft Structure Pin did not interface correctly with structure
57 Human Handover100% tool check not complete at handover; pin
location not specified
58 Human Supervise Maintenance Supervision not close enough to spot errors
64 Organisational Operate Shift PatternShift pattern broke task into factured elements
causing discontinuity
117
Figure 6-9 Visualisation Tool Output for Rigging Tool Occurrence2
2 Note that a Visio software bug means that some connections become ‘un-glued’ when copying and pasting as images – this results in some lines being erroneously pasted into the corner of the drawing. On-screen performance is
not affected.
Licensed Hangar/ Parking Space
71(IR) Sqn – Non Destructive Testing
Force Level 0 Plan
Crew Training Plan
Squadron Planning Staff
Squadron Management Tools
SQEP Engineering Management
ESLOPS (Aircraft State Database)
Personal Notes
LITS Instructions
MJDI System
BAES Supply IT System
StorageTransport Supply Orders
JSP800/886
JSP 886 Pipeline Times
WeatherBowser & Driver
Strategic Fleet Plan
Joint Business Agreement
ATTAC Contract (BAE Systems)
Capability Development Programme
GR4mations IT Tool
Joint Business Agreement
MILITARY EFFECT
AP100E-15
RB199 Ground Support Station
JetscanDetuner / HP Bay
ROCET Contract (Rolls Royce)
JAMES (IT system)
Capability Requirements Management
Investment Appraisal & Business CaseCommercial
Arrangements
Project Management
5000 Series Regulatory Articles
F799 Instructions for Use – Maintenance Log
Airworthiness/Safety Delegation Holders
Trilogi System
4000 Series Regulatory Articles
RESOLVECAMO Staff
Manual of Airworthiness Processes -01
Other Nations: Tornado Tech
Warning/Special Technical Order
Project Commercial & Financial Advice
Tornado Equipment Safety Management Plan
Commodity Internal Business
Agreement
Inventory Management
Staff
Explosives Regulations
Supply Personnel
Integrated Engineering Database
EDSR (Drawings database)
NETMA
PROQUIS
External Communications
Dynamic Environment
Aircraft Abandoned
Flight Authorisation Process
Qualified and Current Aircrew
AP100B-01 Handover Policy
Duty Auth
Squadron Golden Rules
Maintenance Personnel Assigned to Post
Phase 1 & 2 Training
Trainee Maintenance Personnel
Rigs
Anywhere in system
Reporting / Just Culture + Occurrence or Perception of Risk
Figure 6-10 Instantiation of Rigging Pin Occurrence
Flying Requirement Not Sufficiently Reduced for Additional Operational Requirement
Reduced number of SQEP Technicians on Shift List
Reduced EngineeringResources
Lack of Continuity in Tasking
Insufficient Manpower to Meet Requirement
Insufficient Time to Conduct Continuous Work on Task
Rigging Pin Left In Mechanical System
Inadequate Time Allowed to Conduct Handover
Supervisor was not SQEP
Supervisor Had Received Insufficient On-the-Job Training
Trained Personnel Diverted to Operations – No Back Fill
Flying Requirement not Matched to Engineering Resource
Discontinuity in Allocation of Personnel to Task
O
C
P
I
T
R
Record Work done on Aircraft
6
O
C
P
I
T
R
Tools & Test Equipment
13
O
C
P
I
T
R
OperateShift Pattern
64
O
C
P
I
T
R
Mechanical Systems
49
O
C
P
I
T
R
Task Maintenance
5
O
C
P
I
T
R
Corrective Maintenance
43
O
C
P
I
T
R
Supervise Maintenance
58
O
C
P
I
T
R
Handover
57
O
C
P
I
T
R
Train Maintenance
Personnel
7
O
C
P
I
T
R
Force & A4 Operations
21
O
C
P
I
T
R
Publish Approved Data
30
O
C
P
I
T
R
3 Month Flying Programme
4
Supervision did not highlight errors in pin placement
NCO in Charge of Tool Stores was Not Available
Supervisor not SQEP
Placement of Rigging Pin Inadequately Defined
Rigging Pin Set Issued with Pin Missing
Location of Rigging Pins Not Described in Handover
O
C
P
I
T
R
Plan Weekly/Daily Flying Programme
60
HLWSCU Fit/Removal/Test Repeatedly
Interrupted to Divert Resources
Position of Rigging Pins Inadequately Documented
Flying Requirement not Matched to Engineering Resource
Supervision Broken Across Shifts
Supervisor Time Required for Higher Priority Tasks
Flying Requirement not Matched to Engineering Resource
Insufficient Total Manpower to Match the Task
Flying Requirement Not Matched to Engineering Resource
O
C
P
I
T
R
Maintenance Personnel
8
Rigging Pin Set Returned with Pin Missing
Tool Stores Worker Inadequately Trained
Functional Test Passed with Pin In place
Rigging Pin Incorrectly Positioned
Handover Tool Checks not Completed
Supervision did not ensure correct pin placement
Notes: Red lines show activities with unacceptable variability. Black lines show other activities recorded or inferred as
having significant variability in the Occurrence Investigations. Other aspects parts of the system are not shown for clarity. Variability could be traced back to other functions not
currently shown, with further investigation.
O
C
P
I
T
R
EXCESS OUTPUT
VARIABILITY
O
C
P
I
T
R
INADEQUATE DAMPING
O
C
P
I
T
R
DIRECT INTERFACE
WITH AIRCRAFT
O
C
P
I
T
R
AIRCRAFT SYSTEMS
Harmful Variability
Variable Activity
119
6.2.10 Insights from TASM
The TASM again shows this incident as a control problem; figure 6-10 provides
an instantiation of the TASM. It shows in red a number of activities which linked
functions and caused output variability to permeate downstream through the
system. Other activities are shown where they are mentioned in the RAF
investigation. Many of these activities could have exerted more control over the
functions that produced variable output. In some cases a complete control loop
was missing – despite the practise of CBs not being ‘safety-tagged’ when in the
pulled condition, there was not quality control over this practise. The quality
function is shown as having an unlinked activity, which for the purpose of this
instantiation meant that there was no output. The main variability in question in
this occurrence was that surrounding the performance of tool control processes
and the way the corrective maintenance was conducted on the mechanical
system. There was a potentially harmful variability from the corrective
maintenance function in that the rigging pin was left in the system, which
resonated with the way that the mechanical system performed under the
functional test – passing with the pin in place. If the whole system had been
operating within acceptable bounds of control then the supervision and tool
control function would have provided further damping through checks
adjustments on the way that the scheduled maintenance was conducted. It was
a serendipity rather than ‘design for resilience’ that provided a warning that
there was a tool control issue before the aircraft was released in an un-
airworthy condition. A source of variability that is shown to permeate through
the system was the operational plan to divert resources to Operation ELLAMY.
The flying and shift programming functions did not adjust their outputs to
compensate adequately and neither did the maintenance tasking function. A
potential damping mechanism was therefore lost. The DASOR included an error
management investigation which focussed on the organisational and human
failings in the scenario. The benefit in using FRAM is that it shows how all of
these aspects of the situation are linked.
121
7 USING THE TORNADO AIRWORTHINESS SYSTEM
MODEL FOR RISK ANALYSIS
The use of FRAM for risk analysis has been the subject of discussion amongst
researchers and an accepted form of practise has not yet been developed. This
chapter seeks to contribute to this development process. The FRAM attempts to
provide a more complete solution for managing risks in complex systems in
comparison to other more linear methods. In this chapter the existing risk
management process is described along with the current theoretical basis for
risk management. A new theoretical basis is proposed and then a new risk
assessment process is given. This is followed by a detailed example. The new
theoretical basis and assessment technique is then combined to give a new
approach to risk management.
7.1 Case for Using TASM for Risk Analysis
Airworthiness (or ‘equipment safety’) risk management systems employed by
the MOD in relation to Tornado include the construction and management of a
Weapon System Safety Case, which uses Goal Structured Notation (GSN). Part
of the safety case argument (Manson, 2001), is the requirement to manage
equipment safety risks to within the MAA’s targets for Risk of Death from All
Causes as required by RA1210 (MAA, 2012a). This is achieved through the use
of an Equipment Hazard Management Process (LI-BS0056) summarised in
Figure 7-1 (MOD, 2013a). A Fault Tree Loss Model is used to aid this process
of assessment. The Air Safety Duty Holder then maintains a platform risk
register, which forms part of the overall Operational Safety Case. Bow tie
models are beginning to be used to aid the assessment of operating safety
risks. All of these models are based on the assumption that safety is a resultant
property of the aggregated activities, arguments or elements of the model or
safety case. The resilience engineering understanding of safety is that it is
equivalent to its converse condition (an accident) and both of these system
states are emergent properties. The purpose of developing the TASM for risk
assessment is to provide a resilience engineering view of safety risks as a more
realistic contrast to linear methods such as bow tie, which have the potential to
122
produce a false level of accuracy when applied to a human-centred centric
system such as the Tornado airworthiness system. A resilience engineering
perspective may produce either a more positive or negative view of a particular
risk but dependent on the level of complexity that applies to the process under
consideration, it is unlikely to be able to produce a quantifiable assessment. To
achieve quantification, Bayesian or fuzzy logic principles are required. It should
be noted of course that the risk assessment process is an implied part of the
‘cost-benefit analysis’ function carried out by the Tornado Engineering Authority.
Figure 7-1 Tornado Process for Emergent Airworthiness Issues (MOD, 2013)
With regard to safety critical complex systems in general, isolating individual
potential risks is challenging due to the interconnected nature of such systems.
Initially FRAM analysis will provide a Resilience Engineering assessment of
123
risks that have already been identified within a system. It may allow a more
realistic understanding of the nature of particular hazards and how they may be
avoided or mitigated. Typically, hazard logs and risk registers record isolated
hazards/risks; FRAM has the potential to more accurately describe how both
hazards and mitigating (or damping) factors are linked. Many accident reports
detail seemingly unlikely combinations of unfortunate circumstances; FRAM
seeks to deal with the issue of harmful combination of varying functional output
more effectively. This provides a novel approach to assessing Common Cause
failures during airworthiness assessments, particularly with respect to
maintenance or design ‘error’. As the approach develops, it may be possible to
identify the potential for previously unexpected risks to emerge. Risk analysis
techniques need to work from a theoretical basis for the origin of risk.
7.2 Current Theoretical Basis for Airworthiness Risk
Management
The current theoretical basis for managing Tornado airworthiness risk is shown
in Figure 7-2, which is adapted from the Hazard log structure illustrated in Local
Instruction BS0056 (MOD, 2013a). The overall risk to life from a particular
accident scenario is calculated by means of adjusting the historical reliability or
event rate data within the fault tree loss model to reflect the new issue
identified. Alternatively a qualitative engineering judgement based assessment
using broad likelihood and consequence categories may be employed. Figure
7-2 does not illustrate a process; the arrows indicate the aggregation of risk.
The current theory, as illustrated, starts with various system controls which
prevent accident causes emerging, which in turn lead to hazards which are then
subject to additional controls. Potential accident scenarios may develop
dependent on the likelihood of the preceding elements in the chain. These
potential accidents may develop through a series of events, some of which may
prevent the situation developing into an actual accident. The likelihood and the
severity of the accident should it develop is based on an arithmetic or qualitative
aggregation of all the preceding elements in the chain. This provides an
124
estimate for risk to life for a particular scenario which is then managed
(including changing elements in the chain) in the manner outlined in Figure 7-1.
The current theoretical basis implies that unless explicitly connected in some
way, adjustments to the various controls will provide a resultant increase or
decrease in the risk to life attributable to a particular potential accident scenario.
The advantage of the current theory is that it provides a basis from which risks
can be separated, considered in isolation and managed in an auditable manner.
An overall quantitative or qualitative estimation of risk to life is based in a linear
aggregation of estimated probabilities of hazards occurring and the
effectiveness of the various mitigating pre or post-accident controls. Historical
data in the form of a loss model is used alongside data relating to specific
issues, such as failure rate data from inspections or tests (MOD, 2013). The
combination of loss models, hazard logs, and risk registers are used to model
the system on the basis of the theory shown in Figure 7-2.
7.3 Proposal of FRAM Based Airworthiness Risk Theory
Resilience engineering theory and FRAM in particular proposes that accidents
are an emergent result of system performance and accidents themselves are
mitigated or prevented by the system behaviour once a hazard has begun to
emerge. In the TASM, potential accident sequences3 may be modelled in very
broad terms through the ‘operate aircraft’ and the various aircraft subsystem
technological functions. Figure 7-3 proposes an alternative or complementary
3 With TASM an accident sequence starts when an aircraft is operating hazardous. It does not
refer to the way the airworthiness management system is behaving at any particular time.
Cause Hazard Potential Accidents Controls
Controls & Events
Figure 7-2 Current Theoretical Basis for Tornado Airworthiness Risk Management
Cause Hazard Controls
Accident
Controls
Controls
Risk to Life
125
theoretical model for the derivation of risks to life to the current theory shown in
Figure 7-2. In this case it is not possible to directly calculate quantitative risks
without some means of describing the TASM in quantitative terms itself; through
some numerical or algebraic calculation of model behaviour. As the TASM
already contains some qualitative descriptions of performance, it should also be
possible to provide a qualitative output in terms of risk. The risk to life may only
be calculated by estimating the likelihood of the system developing into a
functionally resonant state (or states) where an accident is generated.
Figure 7-3 Proposed Functional Resonance Risk Management Theory -
Visualisation of a Generic Hazardous Process
The theory shown in Figure 7-3 makes it difficult to extract a meaningful
description of any risk from the system, as this risk is the product of both the
damping and varying performance of a function. The literature does not provide
HAZARD
O
C
P
I
T
R
Hazard Generating
Function
O
C
P
I
T
R
Upstream Background
Function
O
C
P
I
T
R
Upstream Damping Function
O
C
P
I
T
R
Upstream Forcing
Function
O
C
P
I
T
R
Upstream Forcing
Function
O
C
P
I
T
R
Upstream Damping Function
External Dependency
O
C
P
I
T
R
Downstream Aircraft System
or Operating Function
O
C
P
I
T
R
Upstream/Downstream
Damping Function
ACCIDENT
126
examples of how this can be done using the existing FRAM; a bespoke
technique has therefore been developed. For an in-service system such as
Tornado, many risks are currently recorded and it is likely other potential risks
remain unrecorded. Such unknown risks may become apparent through close
examination and experimentation with the TASM. The reassessment of
currently recorded risk is easier and is the focus of this initial study. Clearly
airworthiness risk to life will only manifest itself through unacceptable variation
in the output of one of the FRAM functions relating to a physical element of the
aircraft. For example, the mechanical system might prevent proper control of
the aircraft or the structure may fail to react loads through a loss of structural
integrity. Defining an associated risk likelihood element depends on the
upstream performance variability of the system. Likelihood of hazardous
variability will also depend on the effectiveness of upstream/downstream
functions in providing damping against harmful variability. Where this
upstream/downstream damping fails to control the hazardous variability,
functional resonance occurs. In order to examine the risk associated with a
particular hazard, the following terms are defined:
Hazardous Process – All functions and activities contributing to the
hazard generation.
Hazard Generating Function – The aircraft system function whose
variable output directly generates a hazardous condition.
Upstream Forcing Function – A function whose output contributes to
forcing the generation of a hazardously variable output from a
downstream function.
Upstream Damping Function – A function whose output reduces the
variability of a downstream Hazard Generating Function.
Background Function – A function whose output can be assumed not
to vary to any significant degree and does not contribute to the variability
of the hazard generating function.
A hazardous process and hence a hazard can emerge in any part of the
system. An airworthiness related accident can only occur as a result of one or
127
more hazardous processes producing uncontrollable variable output from
aircraft system functions (structure fails, electrical fire, loss of power etc.). In the
majority of cases this would be deemed a technical failure in the existing
Tornado air safety risk management process. For example, a hazard might be
corrosion of the aircraft structure. An accident relating to this hazard would
involve the loss of structural integrity as a result of an out of limits variability in
the react-loads output from the aircraft structure function. Corrosion can be
considered to be largely due to internal variability as it is related to the material
of the structure. This is of course based on the assumption that the environment
is not a function within the TASM. If there is significant variability in the
environment experienced across the fleet, then this should be mapped as a new
function in the TASM. Corrosion may also be caused by external variability from
other functions – damage inflicted during maintenance or contamination from
other aircraft systems for example. These would be upstream forcing functions
and could link together in a hazardous process. Damping of this negative
variability is provided by a series of control processes acting on the aircraft
structure function, for example structural inspections called up during scheduled
maintenance and specified by the Engineering authority in the approved data as
a result of a cost benefit analysis. Airworthiness related issues could also play a
part in generating accidents by providing an upstream forcing function for the
‘operate aircraft’ function whilst still remaining providing an output that varies
within the bounds of the system specification. For example, the output of the
avionics flight system may provide an accurate but potentially confusing signal
to aircrew, which when combined with the internal variability of the ‘operate
aircraft’ function may result in an accident. Test and Evaluation is intended to
identify such hazardous variability.
7.4 Proposal for a FRAM Based Risk Assessment Process
As already discussed, quantitative risk assessment is difficult using the FRAM.
It would be possible to apply probabilistic data to the outputs of various
functions in the TASM, however without some using fuzzy or Bayesian
techniques it is not possible to model how these probabilities aggregate through
128
any particular process. To do so would require assumptions to be made that all
functions not considered are background functions and will not vary in output as
a result of the variability in the process under consideration. An inspection of the
TASM shows that the level of connectivity between all functions means that any
such assumption is of dubious validity. Figure 7-4 shows the risk assessment
methodology developed for this project. This iterative process expands through
the TASM allowing all functions in the hazardous process to be highlighted and
then the hazardous output variability to be re-assessed based on the forcing
and damping functions in play.
129
Figure 7-4 FRAM Model Risk Assessment Process
Identify Hazard
Identify the initial/next Hazard Generating Function and select in
Visualisation Tool
Select Upstream Function
Is Upstream Function Forcing, Damping or Background?
highlights how the Continuing Airworthiness Management Organisation (CAMO)
tasks have been distributed across a variety of organisational boundaries. In
particular many of the tasks have been contracted to industry but the MOD
organisation responsible for these contracts works for the TAA. Whilst the
background layer in the TASM broadly shows areas of responsibility, CAMO
tasks do in fact stretch across most areas of the TASM. Table 8-3 shows how
the TASM will assist the CAMO in undertaking its tasks. The CAM has
responsibility for more dynamic elements of the system compared to steadier
state activities undertaken by TAA staff. Modelling this system will at least
promote greater understanding of how the various components interact. Where
the TASM is more use to the TAA as a risk assessment tool, to the CAMO it is
more useful as a more general management tool. It may provide a useful check
on future system changes, allowing analysis of downstream implications of any
change prior to implementation. It could be argued that a knowledgeable and
experienced manager would have an instinctive handle on the implications of
change without use of the tool. The modelling process has shown a high level of
complexity exists so it is moot as to whether an intuitive decision making
process would be able to consider all factors in as complete a manner without a
model. The scope of management understanding of the control mechanisms
available to adjust the system is currently limited by experience.
163
Table 8-2 Potential CAMO Use of TASM
RA 4947 CAMO Tasks Use of TASM Version 1 Potential Future Uses
a
Develop and control a maintenance
programme including any applicable
reliability programme, proposing
amendments and additions to the
maintenance schedule to the TAA.
Qualitative understanding of the
airworthiness system to
understand any second and third
order implications of changes to
the maintenance programme.
Integrate reliability data into
TASM to provide enhanced
analysis of organisational or
human causes for repeat
arisings.
b Manage the embodiment of
modifications and repairs.
Assessment of the effect of
tasking modification and repairs
by generating an instantiation of
the system.
Incorporate reporting of
modification satisfaction into the
TASM to provide feedback as to
the succes of the plan.
c
Ensure that all maintenance is
carried out to the required standard
and in accordance with the
maintenance programme, and
released in accordance with MRP
Maintenance Certification
Regulation.
Qualitative understanding as to
whether the system as currently
constructed is capable of reliably
implementing the maintenance
programme - for example
assessing the resources
required for variations in the
programme.
Incorporation of resouce data
(e.g. manning spreadsheets) into
the model.
dEnsure that all applicable SI(T)s are
applied.
Assessment of the capability of
the system's ability to reliably
carry out SI(T)s without excessive
variability; understand all of the
upstream dependencies on the
SI(T) functions.
Incorporate SI(T) satisfaction
data into the TASM. Better
predictive capability for SI(T)
satisfaction rate.
e
Ensure that all faults reported, or
those discovered during scheduled
maintenance, are managed correctly
by a Military Maintenance
Organization or MRP/Mil Part 145
Approved Maintenance Organization.
Analysis of the maintenance
organisation's likelihood of
managing faults correctly.
Use predictive data to provide
indications of when organisations
may fail.
f
Co-ordinate scheduled maintenance,
the application of SI(T)s and the
replacement of service life limited
parts.
Provides a top level map of
these activities.
Leading perfomance indicators
to provide warning of potential
failures.
g
Manage and archive all continuing
airworthiness records and the
MF700/operator's technical log.
N/A N/A
h
Ensure that the weight and moment
statement reflects the current status
of the aircraft.
N/A N/A
i
Initiate and coordinate any necessary
actions and follow up activity
highlighted by an occurrence report.
Allows more effective and more
easily implemented
reccomendations to be
generated from occurrence
reports.
Incorporation of ASIMS data into
the TASM.
164
8.7 Utility of TASM for Duty Holder Activity
The TASM was designed as an airworthiness management tool rather than an
air safety tool. None the less it will provide a useful facility for the Duty Holder
and his staff engaged in air safety management; principally it will provide non-
airworthiness staff a better overview and understanding of the airworthiness
system and thus provide for greater challenge for the advice of specialists.
Table 8-3 Aviation Duty Holder Use of TASM
The duty holder is responsible for ‘holding’ the risk to life as a result of any
airworthiness issue that may be present in the system or may arise in the future.
For all of the reasons described in the Chapter 2 this responsibility can be
discharged with greater realism if these risks are analysed from resilience rather
than a traditional linear perspective. It is anticipated that the lack of clarity over
risk may provide significant concern and this risk assessment element is the key
area for further work to develop. However, there is a concern that linear
methods currently provide a spurious level of accuracy in their risk modelling;
Use of TASM Version 1 Potential Future Uses
a
Cease routine aviation operations if
RtL are identified that are not
demonstrably at least Tolerable and
ALARP.
More effective assessment of
risks using a resilience point of
view.
Potential for quantification of risk
analyses.
b
Establish and maintain an effective
ASMS that, wherever
possible,exploits the MOD’s existing
aviation regulatory structures,
publications and management
practices, in order to demonstrate an
acceptable means of compliance
with the requirements in RA1200.
Evolve ASMS into a resilience
engineering based system.
Introduce leading indicators to
increase effectiveness of the
ASMS - incorporated into the
model.
cPromote and lead by example a
questioning Air Safety culture.
Provide DH with an overview of
the system to allow more
effective questioning of the CAM
and the TAA.
Hold TAA and the CAM to
account using quantitative
performance indicators.
d
If necessary, challenge formally any
option or action that is proposed or
implemented by DH-facing
organizations that may result in the
activities for which they are
responsible not being Tolerable and
ALARP.
Provide a ready model which can
used to demonstrate the effect of
DH facing organisations on
airworthiness.
Quantify the effect of DH facing
actions, using intergrated data.
RA 1020 Roles & Responsibilities:
Aviation Duty Holder
165
duty holders may need to accept a greater degree of uncertainty around these
estimates. The difficulty is that the aim of the duty holder in dealing with any
airworthiness related risk to life is to ensure that risk has been reduced to
ALARP and at least within tolerable bounds. Greater uncertainty over
categorisation of risk to life may lead to greater conservatism in dealing with
emerging risks. Whilst this may be beneficial for safety it would be
disadvantageous from an operational perspective. To counter this problem it
should be emphasised that the model shows in greater detail than has been
described before the multiple layer (or loops) of control that provide damping
against harmful activity within the system. A specific insight that was gained
from the model is the degree of connectivity of few specific functions. These
functions are highlighted visually within the model. It would be possible to
conduct a similar exercise for flight safety using a FRAM model and perhaps to
link the 2 models together.
8.8 Potential Use for System Improvement
The literature review outlined successes for using FRAM in process
improvement activity in other industries. Experience has shown that the RAF
has discontinued the use of a variety of ‘lean’ methodologies in the years
following their initial introduction. These linear methodologies have often failed
to deal adequately with either the complexity or the variability of aircraft
maintenance operations. Criticisms have been levelled in that lean sought to
impose production line methodologies on maintenance, which was a poor fit
and as Carney (Carney, 2010) found, disadvantageous for safety. The FRAM
has a great potential to provide an alternative or complementary means of
achieving process improvement. For example, the TASM highlights the
variability in the way that maintenance is tasked; sometimes tasking is
generated through handovers, sometimes it is written on boards and sometimes
tasks are given verbally. A more detailed mapping of this element of the TASM
using FRAM worked through with a facilitated workshop using a variety of
personnel involved in the activity day-to-day, may assist in developing a more
efficient and safer process. This type of system improvement workshop ought to
166
become the de-facto response to any occurrence investigation. The current
system employed by the RAF provides for a hierarchical review and
implementation of recommendations made by occurrence investigators and
review groups. Whilst this accords with the need to vest decision making with
those ultimately responsible for the risk to life, it does remove decision making
further from those who may be best placed to understand the complexities of
the system. Further work is required on how best to implement FRAM into
decision making on occurrence report recommendations.
8.9 Potential for Further Development of the TASM
This project sought to test whether the FRAM could be applied to airworthiness
management and how useful a tool could be created. The version that is
presented in this report is an initial baseline version and whilst it ought to prove
a useful tool in its own right there much scope for further development.
Currently the model exists as two files; one a spreadsheet and another visio
drawing which can be manipulated interactively using the layers feature. It is
possible to embed Visio drawings within Microsoft SharePoint sites as used
within the military IT systems (MOSS). It is also possible to attach data files
drawn from excel to shapes within Visio Drawings. Figure 8-2 shows a potential
development pathway for the TASM. The envisaged end state for development
is tool hosted on standard desk top IT, providing a ‘dashboard’ type function to
show how the whole system is performing. It should be able to display output
from a variety of data sources such as LITS reports, manning information
spreadsheets, ASIMS data and quality audit reports.
167
Figure 8-2 TASM Development Pathway
8.9.1 Increased Model Fidelity
Throughout the development of the model to date there has been a continuous
set of assumptions and simplifications made regarding the operation of the real
world system. Chapter one discussed the nature of complexity and its inherent
incompressibility. With this in mind it is important to remember that the model
TORNADO GR4 AIRWORTHINESS SYSTEM MODEL
Functional Resonance Analysis Method
Type of Function External Variability
1 Name of Function Flight Servicing Human Number Name Aspect
Aspect Description of Aspect
Input Line Controller indicates task on boards 61Rectification & Line Control
Boards
Maintenance
Information for taskingSequence
Missed maintenance
requirementsMedium Medium
Potential to sway ETTO and cause
omissions and errorsINCREASE 12
Pre flight checks, fault reporting, engineering
management supervision
Output AC systems replenished (propulsion, mechanical) AC systems replenished (propulsion, mechanical) Sequence Omission or wrong fluid used High Medium Related fault reporting
AC visually inspected (Avionics, Electrical, Structure,
Mechanical, Crew Escape, Weapons, Propulsion)
AC visually inspected (Avionics, Electrical, Structure, Mechanical, Crew
Escape, Weapons, Propulsion)Sequence Omission High Medium
Aircrew pre-flight checks - feedback info.
Husbandry checks and Airworthiness Review
Any faults recordedAny faults recorded
Sequence Omission High MediumComparison of flt servicing fault reporting
across shifts/sqns etc
Husbandry jobs recorded in logHusbandry jobs recorded in log
Sequence Omission High LowComparison of flt servicing husbandry
reporting across shifts/sqns etc
Flight Servicing Certificate SignedFlight Servicing Certificate Signed
Wrong ObjectSign up for wrong tail number or
omit full informationMedium High Captured in ASIMS reports if found
1 Name of Function Flight Servicing Human Number Name Aspect
Aspect Description of Aspect
Input Line Controller indicates task on boards 61Rectification & Line Control
Boards
Maintenance
Information for taskingSequence
Missed maintenance
requirementsMedium Medium
Potential to sway ETTO and cause
omissions and errorsINCREASE 12
Pre flight checks, fault reporting, engineering
management supervision
Output AC systems replenished (propulsion, mechanical) AC systems replenished (propulsion, mechanical) Sequence Omission or wrong fluid used High Medium Related fault reporting
AC visually inspected (Avionics, Electrical, Structure,
Mechanical, Crew Escape, Weapons, Propulsion)
AC visually inspected (Avionics, Electrical, Structure, Mechanical, Crew
Escape, Weapons, Propulsion)Sequence Omission High Medium
Aircrew pre-flight checks - feedback info.
Husbandry checks and Airworthiness Review
Any faults recordedAny faults recorded
Sequence Omission High MediumComparison of flt servicing fault reporting
across shifts/sqns etc
Husbandry jobs recorded in logHusbandry jobs recorded in log
Sequence Omission High LowComparison of flt servicing husbandry
reporting across shifts/sqns etc
Flight Servicing Certificate SignedFlight Servicing Certificate Signed
Wrong ObjectSign up for wrong tail number or
omit full informationMedium High Captured in ASIMS reports if found
can only provide a rough description of what is happening within the system or
how it likely to behave in any future scenarios. The only way that the fidelity of
the model can be improved is to exercise it against various scenarios and make
iterative adjustments. This process would need careful configuration control as
is currently exercised for the loss model and for risk registers.
8.9.2 Application of Bayesian and/or Fuzzy Logic
The inability to generate simple risk assessments is likely to be seen by users
as a key weakness of the FRAM approach. Whilst this level of uncertainty
potentially a more realistic assessment of the risk, it would be useful to be able
to more accurately assess risk in order that different courses of action can be
compared. For example, one solution to the configuration management issues
highlighted in chapter seven would be to further automate the data capture
process, perhaps with portable devices that could be used alongside the
aircraft. This would remove some of the higher levels of variability that are
provided by human elements of functions such as ‘scheduled maintenance’ and
‘record work done’. It would however require a substantial investment to
introduce such a capability. This would require a business case to allow public
funds to be committed. Existing processes (MOD, 2013) require quantitative risk
assessment to achieve this and generally use ‘waterfall diagrams’ to
demonstrate how risk is mitigated over time. There is therefore a clear
requirement to introduce some further elements of quantification into the FRAM
and the TASM. The currently most promising approach is that which has been
outlined by Slater (2013) who has developed a desktop interface to allow
development of Bayesian logic dependency diagrams. Slater’s tool does not
require a risk assessor to become competent themselves in Bayesian
mathematics. This approach uses FRAM as a framework on which Bayesian
decision nets can be constructed.
8.9.3 Expansion into Operational Safety Management
The entirety of the system for operating Tornado is captured within the ‘Operate
Aircraft’ function. It is recorded in this way because the aircrew interact with the
aircraft systems and hence affect the airworthiness of the aircraft in the short
169
term during a particular sortie or over the long term as patterns of usage affect
the condition of the aircraft systems. Clearly operating the aircraft is a human
function and is also heavily involved in most air accidents. So whilst outside of
the scope of this particular study, there is likely to be significant benefit in
developing a FRAM model to understand flight safety elements of aircraft
operations. This could be linked to the TASM to understand how airworthiness
and flight safety risks are interlinked.
8.10 Chapter Summary
This discussion centred on the applicability of resilience engineering concepts
to the practise of airworthiness engineering. It discussed how the framework
under which airworthiness related safety investigations are conducted could be
adapted to the new ideas. The increase of both realism and uncertainty in risk
assessment using resilience engineering techniques was discussed. The
potential for future development of the TASM was also described.
171
9 CONCLUSIONS
9.1 Summary
The project first reviewed the literature on the general background theories to
safety science and engineering. Three broad themes were identified as having
reached maturity; that of the technological age based on Boolean logic and
reliability studies; the age of human factors and then the age of the
organisational accident. More recent developments included a study of the
effects of complexity and control theory in order to understand safety. From
these roots resilience engineering has been highlighted by some as a new
paradigm in safety. The literature around resilience engineering was found to be
somewhat fractured however several key themes to the topic where identified
and discussed. Principally, a resilience engineering perspective views safety as
the system’s ability to perform under disturbed and potentially unexpected
conditions. Safety therefore becomes a control problem rather than a reliability
problem. Safety is an emergent property rather than a direct linear resultant
property determined by the reliability of the system’s components and their
mode of use. The Functional Resonance Analysis Method (FRAM) was
identified as having the greatest potential to operationalise these new theories
in existing airworthiness organisational systems. The methodology describes
the system in terms of its functions, linked by activities. Harmful activity may
emerge as a result of the non-linear combination variable functional outputs,
occasionally causing functional resonance which may propagate in the form of
uncontrollable output variability across the system. Using FRAM a spreadsheet
model was developed for Tornado Airworthiness; alongside this an interactive
visualisation tool was also developed. The model was based on a number of
interviews with personnel within the airworthiness system and on data from the
MOD’s Air Safety Information Management System, alongside a large amount
of policy documentation. This Tornado Airworthiness System Model (TASM)
was tested by taking the results from two separate incidents and describing the
scenarios in terms of functional resonance. This identified that the model was
consistent with both scenarios but also raised various questions over the
172
assumptions behind the investigations. The TASM was also used to investigate
the risk posed by the operation of components in excess of their cleared life on
the Tornado. This analysis highlighted that the model in its current form was not
able to quantify the risk in anything other than very general terms. However the
model did illustrate how the various factors were responsible for either forcing or
damping the variability of the functional output of the ‘replace life limited parts’
function within the model. This method of analysing risk scenarios provides
additional insight that traditional reporting techniques do not. Resilience
engineering and the FRAM in particular was shown to offer a great deal of
insight into how airworthiness may be more effectively managed. The research
objectives were:
Review the theoretical background to safety management and the
implications for airworthiness management.
Review the concepts of Resilience Engineering with an emphasis on
application to airworthiness management.
Establish a theoretical framework for a model of an airworthiness
management system.
Gather and use primary research data to establish and validate a model
of the airworthiness management system for the RAF Tornado Force.
Using the model, develop a tool to enhance the airworthiness
management system of the RAF Tornado Force.
All of these objectives were met and it can be concluded that the project has
produced an operationally useful tool which will enhance the management of
airworthiness across the RAF’s Tornado fleet using the latest safety thinking.
9.2 Recommendations
Whilst resilience engineering in general and the TASM in particular require
extensive development, the following specific recommendations are given with
respect to the RAF Tornado case study.
173
9.2.1 Manage Airworthiness as a Control Problem
Quantitative or probabilistic risk assessments are well suited to reliability
analysis of components or subsystems. Such analyses are of dubious validity
when considering complex systems and even more so where there is a large
element of human and organisational interaction. These cases apply to
airworthiness issues and as such it is better to combine reliability analyses with
a treatment of the achievement of airworthy systems as an ongoing control
problem.
9.2.2 Use the TASM to Control the Airworthiness System
Control of airworthiness systems can effectively be modelled by the Functional
Resonance Analysis Method, where harmful activity occurs when the output
from one function becomes coupled in a resonant manner with the aspect of
another function. The Tornado Airworthiness System Model (TASM) provides a
baseline model from which such analyses can be carried out.
The TASM will prove to be a powerful tool for occurrence investigation and
should be used a baseline from which to conduct such investigations.
9.2.3 Review Airworthiness Risk from a Resilience Perspective
Where it is necessary for the Tornado TAA and Duty Holder to sentence
emerging air safety risks that have any connection to airworthiness
management, the TASM should be used to review the risk from a resilience
engineering perspective alongside existing methodologies required by the MAA.
9.2.4 Use FRAM as a Means to Improve System Resilience and
Efficiency
Where incidents occur, the TASM should be the baseline investigative tool.
Other Quality and continuous improvement activity should use the TSAM and
FRAM as a means to seek out improvements in safety and in efficiency across
organisations involved in airworthiness. In particular FRAM can be used as an
alternative to linear ‘lean’ techniques when dealing with complex working
environments.
174
9.3 Potential for Further Research and Development
This project has provided a very initial look into resilience engineering with
respect to airworthiness. There is a large amount of further research that can be
conducted into this area. General themes should encompass:
The use of FRAM as a technique for investigating air accidents and air
safety occurrence reports.
The development of FRAM to produce quantitative and qualitative risk
assessments, particularly focussing on how it may be used as a
framework to develop Bayesian probability models.
Development of techniques, protocols and standards for conducting
FRAM workshops, whether for the analysis of safety issues or for the
purpose of improving quality or safety.
This project has created an initial version of the TSAM, which while useful, will
require extensive further development:
Integration of the FRAM spreadsheet as data attached to the functional
shapes within the visualisation tool to allow easier interpretation.
The visualisation tool into a Microsoft SharePoint site to allow further
Integration of development as an airworthiness ‘dashboard’.
Development of the leading safety indicators identified within the TASM.
Linking of existing and new data sources as leading safety indicators in
a TASM ‘dashboard’ to provide a mechanism for day-to-day
management of the airworthiness system and enhance the ability of both
the CAM and the TAA to take appropriate airworthiness decisions.
9.4 Concluding Remarks
This study has taken a new set of safety science concepts and has sought to
apply them to the management of airworthiness. This activity has been largely
successful although inevitably there will need to be a further continuous process
of iteration and improvement to the tools produced. The background to this
project was the questions posed by the Nimrod Review. It is clear that in the
light of the new safety paradigm described by resilience engineering, that in the
175
case of Nimrod, the airworthiness system had gradually slipped out of control
and that a variety of functions had begun to resonate with each other resulting
eventually in uncontrollable interaction between the fuel, mechanical and
electrical systems to produce the catastrophic loss of the aircraft and crew.
Resilience engineering and FRAM provide a basis for more effective future
control of the organisational, technological and human functions involved in
airworthiness. Better upstream management of airworthiness controls will
prevent some future pilot having to “fight with the controls” in the face of some
potential downstream catastrophe.
177
REFERENCES
Anon, (2011) ‘Supervision High Up on the Equator - the Puma Force in Kenya’, Air Clues, July [Online], Available at: http://www.raf.mod.uk/rafcms/mediafiles/29D67908_5056_A318_A8AFDA410071E0B8.pdf (Accessed: 1 December 2013).
Aitken, H. (2009) LITS Business Data Corruption, MOD: Internal, DES/WYT/595441/4 20 May 09.
Apostolakis, G. E. (2004) ‘How useful is quantitative risk assessment?’, Risk Analysis, vol. 24, no. 3, pp. 515-520.
Bagwell, G. (2011) 1 Gp ODH ALARP Statement - Operation of Components in Excess of Cleared Life, MOD Internal RESTRICTED, TOR 01.
Beauchamp, E. (2006) ‘Learning from Diversity: Model-Based Evaluation of Opportunities for Process (Re)-Design and Increasing Company Resilience’, The Second Resilience Engineering Symposium, Antibes – Juan-Les-Pins, France 8-10 November 2006: Resilience Engineering Association, pp. 23.
Belmonte, F., Schön, W., Heurley, L. and Capel, R. (2011) ‘Interdisciplinary safety analysis of complex socio-technological systems based on the functional resonance accident model: An application to railway traffic supervision’, Reliability Engineering & System Safety, vol. 96, no. 2, pp. 237-249.
Bendat, J. S. (1998) Nonlinear system techniques and applications, New York: Wiley.
Brooker, P. (2011) ‘Experts, Bayesian Belief Networks, rare events and aviation risk estimates’, Safety Science, vol. 49, no. 8–9, pp. 1142-1155.
Cambon, J., Guarnieri, F. and Groeneweg, J. (2006) ‘Towards a new tool for measuring Safety Management Systems performance’, Learning from Diversity: The Second Resilience Engineering Symposium, Antibes – Juan-Les-Pins, France 8-10 November 2006: Resilience Engineering Association, pp. 53.
Carney, P. (2010) Critical Analysis of the Airworthiness Impact of Lean Production Principles in a Depth Maintenance Organisation . MSc thesis, Cranfield University.
Casey, T. (2013) Tornado Continuing Airworthiness Management Exposition (CAME) MOD Internal RESTRICTED, CAMO/CERT/2012/018.
Cilliers, P. (2005) ‘Knowing complex systems’, in Richardson, K. (ed.) Managing Organizational Complexity: Philosophy, Theory, and Application, Greenwich, CT: ISCE Publishing, pp. 7-19.
Cooke, P. (2004) Panavia Tornado GR4 [Online], Available at: http://www.airliners.net/photo/UK---Air/Panavia-Tornado-GR4/0636414/L/ (Accessed 5 March 2014).
Coury, B., Kolly, J., Gormley, E. and Dietz, A. (2008) ‘The central role of principal issues in aviation accident investigation’, Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 52, Sage Publications, pp. 99.
Crown Copyright (2009) 31 Squadron Tornado [Online], available at: http://www.raf.mod.uk/gallery/tornadogallery.cfm?start=1&viewmedia=4#pageContent (Accessed 5 March 2014).
de Carvalho, P. V. R. (2011) ‘The use of Functional Resonance Analysis Method (FRAM) in a mid-air collision to understand some characteristics of the air traffic management system resilience’, Reliability Engineering & System Safety, vol. 96, no. 11, pp. 1482-1498.
De Landre, J., Gibb, G. and Walters, N. (2006) ‘Using Incident Investigation Tools Proactively for Incident Prevention’, Meeting of the Australian and New Zealand Society of Air Safety Investigators. Australia: Australian and New Zealand Society of Air Safety Investigators [Online]. Available at: http://asasi.org/papers.htm (Accessed 13 November 2013).
Dekker, S. (2003) ‘When human error becomes a crime’, Human Factors and Aerospace Safety, vol. 3, pp. 83-92.
Dekker, S. (2005) ‘9 Why we need new accident models’, Contemporary issues in human factors and aviation safety, pp. 181.
Dekker, S., Cilliers, P. and Hofmeyr, J. (2011) ‘The complexity of failure: Implications of complexity theory for safety investigations’, Safety Science, vol. 49, no. 6, pp. 939-945.
Dudman, D., ( 2012) ‘No 1 Group Air Safety Management Plan’, 3rd ed., Royal Air Force Internal, Defence Intranet.
Edwards, J. R. D., Davey, J. and Armstrong, K. (2013) ‘Returning to the roots of culture: A review and re-conceptualisation of safety culture’, Safety Science, vol. 55, no. 0, pp. 70-80.
Espejo, R. (1989) ‘A cybernetic method to study organizations’, The Viable System Model: Interpretations and Applications of Stafford Beer’s VSM, pp. 361-382.
Freed and Priday, R. (2008) ‘Annex A to BP 1301 - Initial Report of Serious Occurrence or Fault’, MOD Internal.
Gale, I., Keeling, A. and Strasdin, S., (2013) ‘Perfect Storm’, Air Clues, July [Online], Available at: http://www.raf.mod.uk/rafcms/mediafiles/3AE4263C_5056_A318_A883FF5D10B24E91.pdf
Grøtan, T. O., Størseth, F. and Albrechtsen, E. (2011) ‘Scientific foundations of addressing risk in complex and dynamic environments’, Reliability Engineering & System Safety, vol. 96, no. 6, pp. 706-712.
Haddon-Cave, C. (2009) The Nimrod Review, London: The Stationary Office.
Hale, A. and Borys, D. (2013) ‘Working to rule or working safely? Part 2: The management of safety rules and procedures’, Safety Science, vol. 55, no. 0, pp. 222-231.
Heinrich, H. W., Petersen, D. and Roos, N. (1950) Industrial accident prevention, McGraw-Hill:New York.
Herrera, I. (2012) Proactive safety performance indicators. PhD thesis Norges teknisk-naturvitenskapelige universitet, Institutt for produksjons- og kvalitetsteknikk [Online]. Available at: http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-16990.
Hitchens, D. (2003) Advanced Systems Thinking, Engineering and Management, 1st ed, Artech House: Norwood.
Hodson, C. J. (2008) Civil Airworthiness for a UAV Control Station. MSc thesis. University of York [Online]. Available at: http://www-users.cs.york.ac.uk/~mark/projects/cjh507_project.pdf
Hollnagel, E. (2011) Resilience engineering in practice: A guidebook, Farnham, Surrey: Ashgate Publishing.
Hollnagel, E. (2012) FRAM: The Functional Resonance Analysis Method Modelling Complex Socio-Technical Systems, Farnham, Surrey: Ashgate Publishing.
Hollnagel, E. (2014) The Functional Resonance Analysis Method, 20 March [Online] Available at: www.functionalresonance.com.
Hollnagel, E. and Woods, D.(2005) Joint cognitive systems: Foundations of cognitive systems engineering, NW: CRC Press.
Hollnagel, E., Woods, D. and Leveson, N. (2007) Resilience Engineering Concepts and Precepts, Farnham, Surrey: Ashgate Publishing.
Hounsgaard, J. (2013) Using FRAM as a Quality Improvement Tool in Health Care [Online], available at: http://functionalresonance.com/onewebmedia/FRAMily_2013_Hounsgaard.pdf
ICAO ( 2001) Annex 13 to the Convention on International Civil Aviation - Aircraft Accident and Incident Investigation, 9th ed., ICAO [Online]. Available at: http://www.cad.gov.rs/docs/udesi/an13_cons.pdf.
Jeffery, D. (2009) Tornado Configuration Control and Impact on Continued Airworthiness, QinetiQ RESTRICTED, QINETIQ/MS/SES/CR0902379/1.
Johansson, B. and Lindgren, M. (2008) ‘A quick and dirty evaluation of resilience enhancing properties in safety critical systems’, Proceedings of the third symposium on resilience engineering, Juan-les-Pins, France, pp133.
Johnson, C. and Holloway, C. (2004) ‘On the over-emphasis of human ‘error’ as a cause of aviation accidents: ‘systemic failures’ and ‘human error’ in US NTSB and Canadian TSB aviation reports 1996–2003’, Proceedings of the 22nd International System Safety Conference (ISSC). Providence, RI: Systems Safety Society, Citeseer .
Kelly, T. P. and McDermid, J. A. (1999) ‘A Systematic Approach to Safety Case Maintenance’, Computer Safety, Reliability and Security 18th International Conference, SAFECOMP’99. Tolouse, France: Springer, pp. 13-26.
Kontogiannis, T. and Malakis, S. (2012a) ‘Recursive modelling of loss of control in human and organizational processes: A systemic model for accident analysis’, Accident Analysis & Prevention, vol. 48, no. 0, pp. 303-316.
Kontogiannis, T. and Malakis, S. (2012b) ‘A systemic analysis of patterns of organizational breakdowns in accidents: A case from Helicopter Emergency Medical Service (HEMS) operations’, Reliability Engineering & System Safety, vol. 99, no. 0, pp. 193-208.
Le Coze, J. (2013) ‘New models for new times. An anti-dualist move’, Safety Science, vol. 59, no. 0, pp. 200-218.
Leonhardt, J., Macchi, L., Hollnagel, E. and Kirwan, B. (2009) A White Paper on Resilience Engineering for ATM, EUROCONTROL [Online], Available at: www.eurocontrol.int.
Leveson, N. (2011) Engineering a safer world: Systems thinking applied to safety, London: MIT Press.
Lloyd, E. and Tye, W. (1982) Systematic safety, London: Civil Aviation Authority.
Lundberg, J. (2008) ‘FRAM as a risk assessment method for nuclear fuel transportation’, 3rd IET International Conference on System Safety. 20 – 22 October. NEC, Birmingham: Institute of Engineering and Technology.
Luxhøj, J. T. (2003) Probabilistic Causal Analysis for System Safety Risk Assessments in Commercial Air Transport, Department of Industrial and Systems Engineering, Rutgers University [Online]. Available at: shemesh.larc.nasa.gov/ira03/p02-luxhoj.pdf
Luxhøj, J. T. and Williams, T. P. (1996) ‘Integrated decision support for aviation safety inspectors’, Finite Elements in Analysis and Design, vol. 23, no. 2–4, pp. 381-403.
MAA, (2011a) Air Safety Information Management System User Manual, MAA [Online]. Available at: http://www.maa.mod.uk/linkedfiles/occurrence_reporting/20111005asims_user_guide_v42_finalu.pdf.
MAA (2011b) Missing Rigging Pin, asor\Lossiemouth - RAF\XV(R) Sqn\Tornado\11\9110, MAA: Air Safety Information Management System (MOD Internal System).
MAA, (2012a) Gen1000 Series Regulatory Articles, 2nd ed MAA [Online]. Available at: http://www.maa.mod.uk/linkedfiles/regulation/gen1000seriesprint.pdf.
MAA, (2012b) MAA02: Military Aviation Authority Master Glossary, Issue 3 ed., ed MAA [Online]. Available at: http://www.maa.mod.uk/linkedfiles/regulation/maa02.pdf.
MAA, (2013a) RA 1205 – Air System Safety Cases, 2nd ed., ed MAA [Online]. Available at: http://www.maa.mod.uk/linkedfiles/regulation/gen1000seriesprint.pdf.
MAA, (2013b) RA 1210 – Ownership and Management of Operating Risk (Risk to Life, 2nd ed., ed MAA [Online]. Available at: http://www.maa.mod.uk/linkedfiles/regulation/gen1000seriesprint.pdf.
Madni, A. M. and Jackson, S. (2009) ‘Towards a Conceptual Framework for Resilience Engineering’, Systems Journal, IEEE, vol. 3, no. 2, pp. 181-191.
Manson, S. M. (2001) ‘Simplifying complexity: a review of complexity theory’, Geoforum, vol. 32, no. 3, pp. 405-414.
Mason, M. (2012) Tornado Weapon System Safety Case Report Issue 1, EFIPT-ABW/06/01/13/06, MOD: Internal (RESTRICTED).
McDonald, N. (2008) ‘Challenges facing Resilience Engineering as a Theoretical and Practical Project’, Proceedings of the third symposium on resilience engineering, Juan-les-Pins, France, pp205-2010
McKenzie, K. (2012) MR.2 XV230 in the circuit at Kinloss in 2000, available at: http://www.aeroflight.co.uk/wp-content/uploads/2010/03/XV230-02.jpg (Accessed 8 October 2013).
MOD (2007) Safety Management Requirements for Defence Systems, Defence Standard 00-56, Issue 4, MOD.
MOD (2013a) Tornado Local Instruction - Equipment Risk Management, LI BS0056 Version 1.2, MOD: Internal (RESTRICTED).
Nathanael, D. and Marmaras, N. (2006) ‘The interplay between work practices and prescription: a key issue for organizational resilience’, Proceedings of the second symposium on resilience engineering, Juan-les-Pins, France pp. 229.
Oliver, D., Kelliher, T. and Keegan Jr, J. (1997) Engineering Complex Systems, McGraw-Hill.
Oxstrand, J. and Sylvander, C. (2010) ‘Resilience engineering: Fancy talk for safety culture: A Nordic perspective on resilience engineering’, Resilient Control Systems (ISRCS) 2010 3rd International Symposium on, IEEE, pp. 135.
Pasman, H. J., Knegtering, B. and Rogers, W. J. (2013) ‘A holistic approach to control process safety risks: Possible ways forward’, Reliability Engineering and System Safety, vol. 117, pp. 21-29.
RAeS (2013) ‘The Way We Do Things Around Here’ Culture in The Aviation Maintenance and Engineering Environment, Royal Aeronautical Society [Online]. Available at: http://aerosociety.com/Assets/Docs/Events/728/728Programme.pdf (Accessed 8th March).
Rasmussen, J. (1997) ‘Risk management in a dynamic society: a modelling problem’, Safety Science, vol. 27, no. 2–3, pp. 183-213.
Reason, J. (1997) Managing the Risks of Organizational Accidents, 1st ed, Farnham, Surrey: Ashgate.
Reason, J. T. and Hobbs, A. (2003) Managing maintenance error: a practical guide, Farnham, Surrey: Ashgate.
SAE, (1996) Guidelines and Methods for Conducting the Safety Assessment Process on Civil Airborne Systems and Equipment, ARP476, 1st ed. Washington: Society of Automotive Engineers.
SAE (2010) Guidelines for Development of Civil Aircraft and Systems, ARP4754 Rev A, Washington: Society of Automotive Engineers.
Saleh, J. H., Marais, K. B., Bakolas, E. and Cowlagi, R. V. (2010) ‘Highlights from the literature on accident causation and system safety: Review of major ideas, recent contributions, and challenges’, Reliability Engineering & System Safety, vol. 95, no. 11, pp. 1105-1116.
Salmon, P. M., Cornelissen, M. and Trotter, M. J. (2012) ‘Systems-based accident analysis methods: A comparison of Accimap, HFACS, and STAMP’, Safety Science, vol. 50, no. 4, pp. 1158-1170.
Saurin, T. A. and Carim Junior, G. C. (2012) ‘A framework for identifying and analyzing sources of resilience and brittleness: A case study of two air taxi carriers’, International Journal of Industrial Ergonomics, vol. 42, no. 3, pp. 312-324.
Schafer, D. (2012) A Resilience Engineering Primer, Michigan State University [Online]. Available at: https://www.msu.edu/~tariq/Resilience%20engineering%20primer.pdf (Accessed 25 October 2013).
Shirali, G. A., Mohammadfam, I. and Ebrahimipour, V. (2013) ‘A new method for quantitative assessment of resilience engineering by PCA and NT approach: A case study in a process industry’, Reliability Engineering and System Safety, vol. 119, pp. 88-94.
Singleton, C. (2009) Tornado Asset Gateway Proof of Concept - Final Report, 20090519_TAGProofOfConceptFinalReport_R, MOD: Internal (RESTRICTED).
Slater, D., (2013) SIOPS - The New HAZOPS?, Cambrensis [Online]. Available at: http://www.cambrensis.org/wp-content/uploads/2012/05/A-System-Integrity-and-operability-Study.pdf. (Accessed 4 April 2014).
Stolker, R., Karydas, D. and Rouvroye, J. (2008) ‘A comprehensive approach to assess operational resilience’, Proceedings of the third symposium on resilience engineering, Juan-les-Pins, France, pp. 28.
Stoop, J. (2013) To Certify, to Investigate or to Engineer, that is the Question, Resilience Engineering Association [Online], Available at: http://www.resilience-engineering-association.org/download/resources/symposium/symposium-2013/Stoop%20(REA%202013).%20To%20certify,%20to%20investigate%20or%20to%20engineer,%20that%20is%20the%20question.pdf. (Accessed 4 April 2014).
Sugden, G. (2011) Tornado Loss Model and Loss Model Database - December 2011 Update, BAE-WAW-RP-TOR-TGP-5209, BAE Systems: Internal (RESTRICTED).
Vugrin, E. D., Camphouse, R. C. and Sunderland, D., Quantitative Resilience Analysis Through Control Design, SAND2009-5957, Livermore, CA: Sandia National Laboratories [Online]. Available at: http://prod.sandia.gov/techlib/access-control.cgi/2009/095957.pdf(Accessed 4 April 2014).
Wilson, E. S. (2012) The Interaction Of Organisational, Human and Technology Factors On The Effectiveness Of Safety Management Systems And Value Achieved From Deploying New Technology, PhD Thesis. University of New South Wales [Online]. Available at: unsworks.unsw.edu.au/fapi/datastream/unsworks:10843/SOURCE01 (Accessed 4 April 2014).
Wilson, E. (2008) ‘Toward a model of the impact organisation, human and technology factors have on the effectiveness of safety management systems’, Journal of Achievements in Materials and Manufacturing Engineering, vol. 31, no. 2, pp. 827-836.
Woltjer, R. (2007) ‘A systemic functional resonance analysis of the Alaska Airlines flight 261 accident’, Human Factors and Economic Aspects on Safety, pp. 83.
Zarboutis, N. and Wright, P. (2006) ‘Using complexity theories to reveal emerged patterns that erode the resilience of complex systems’, Proceedings of the Second Symposium on Resilience Engineering, Juan-les-Pins, France, pp. 1999.
The following File was submitted electronically: TASM V1.xls
187
Appendix B – TORNADO AIRWORTHINESS MODEL
VISUALISATION
The following files were submitted electronically:
TASM Visualisation Tool V1.vis (Viso file)
TASM Visualisation Tool V1.pdf (large image of Viso file showing all
layers)
TASM Visualisation Tool V1 BLUEPRINT.pdf (large image of Viso file highlighting
connections)
188
Appendix C – PARTICIPANTS BRIEFING SHEET
RESILENCE ENGINEERING STUDY Thank you for agreeing to take part in this post graduate research study undertaken with Cranfield University. The aim is to improve the management of airworthiness in the RAF using Resilience Engineering principles. What is Resilience? Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. What is Resilience Engineering? Resilience Engineering is the practise of designing or modifying resilience into a system, whether the system is a piece of technology (such as a Tornado) or a complicated organisation (RAF and its contractors). It is a move away from ‘linear’ thinking which has produced overly simplistic models of safety such as the (in)famous ‘Swiss cheese’ or ‘bow ties’. It describes complex systems at a manageable level of detail without discarding critical connections. Principles 1. The orders and instructions we work to never quite match the real world. Individuals and organisations must therefore adjust what they do to match current demands and resources – this is generally an approximation.
2. Some adverse events can be attributed to a breakdown or malfunctioning of components and normal system functions, but others cannot. The latter can best be understood as the result of unexpected combinations of performance variability. 3. Safety management cannot be based exclusively on hindsight (occurrence investigations), nor rely on error tabulation and the calculation of failure probabilities (risk registers). Safety management must be proactive as well as reactive. 4. Safety cannot be isolated from the core business of producing aircraft, nor vice versa. Safety is the prerequisite for productivity, and productivity is the prerequisite for safety. Safety must therefore be achieved by improvements rather than constraining how we work with a multitude of ‘safety barriers’. How? – Understand Combinations of Performance Variability; Functional Resonance The study will map the whole socio-technical system that produces an airworthy Tornado. This includes everything from an AMM servicing a jet; to a design
189
engineer producing a modification; to the fleet planning office. The whole system comprises a variety of processes which are made up of a variety of functions (or activities). Functions are linked together by a variety of aspects – your subject matter expertise is needed to understand the different aspects of your function. Aspects of Functions using an aircraft take-off as an example Input – that which the function processes or transforms or that which starts the function. Clearance to take-off from ATC. Preconditions – Conditions that must exist before a function execution. Aircraft on the runway. Resources – that which the function needs or consumes to produce the output. Aircraft, fuel, etc. Control – How the function is monitored or controlled; plan, programme, instructions. Checklist. Time – temporal constraints affecting the function. Take-off slot. Output – that which is the result of the function, either an entity or a state change finishing time or duration. Aircraft becomes airborne. A Function and its Aspects:
A model built using the Functional Resonance Analysis Method Source: (Leonhardt et al., 2009)