PhD Thesis - HF in ATC

FRAMEWORK FOR THE ANALYSIS OF CONTROLLER RECOVERYFROM EQUIPMENT FAILURES IN AIR TRAFFIC CONTROL

Branka Subotic (MSc BSc)

April 2007

A thesis submitted for as fulfilment of the requirements for the degree of Doctor ofPhilosophy of the University of London and for the

Diploma of Membership of Imperial College London

Centre for Transport Studies Department of Civil and Environmental Engineering

Imperial College London, United Kingdom

Declaration

At various stages during this PhD, I was involved in collaborative efforts with both

academic and industrial colleagues. In certain cases, the outputs of these collaborations

are included in this thesis to better explain and support the research presented. In

particular, during the period 2004 to 2005, colleagues from the Air Traffic Management

(ATM) Group at the Centre for Transport Studies, Imperial College London, assisted in the

questionnaire-based survey of air traffic controllers. This mainly involved the distribution of

questionnaires and collection of the responses.

Furthermore, a key element of the research presented in this thesis is the experiment

conducted at a facility owned and operated by a Civil Aviation Authority (CAA). The

experiment was facilitated by the assistance of various Air Traffic Control (ATC) Centre

staff including ATM specialists, ATC controllers, pseudo-pilots, engineers, and technicians.

Finally, EUROCONTROL staff provided a valuable contribution at various stages of this

research in terms of access to relevant publications, professional networks, and simulation

trials.

I hereby declare that besides the collaborations referred to above, I have personally

carried out the work described in this thesis:

…………………………………………………..

Branka Subotic

…………………………………………………..

Dr. Washington Yotto Ochieng

ii

Abstract

An Air Traffic Control (ATC) system represents a set of components that act together to

achieve a safe and efficient flow of traffic in any given airspace. The elements of this

system are human operators, equipment, and procedures, along with all the interactions

between them. Failure of equipment, as one component of an ATC system, and its

interaction with human operators (i.e. air traffic controllers) is the main focus of the

research presented in this thesis. Thus, the thesis focuses on the human recovery process

triggered by failure of equipment that support air traffic controllers in the provision of air

traffic services in a dedicated airspace. A detailed understanding of the controller recovery

process has the potential to significantly contribute to safety and operational efficiency in

the current and future ATC environment. Currently, there is a very limited understanding of

the factors that influence the recovery process, particularly with respect to equipment

failures in ATC. This thesis builds on existing relevant research in other industries and

uses targeted experiments and mathematical modelling to develop a functional

relationship between recovery and its influencing factors.

The research presented in this thesis addresses on two areas, namely equipment failures

in ATC and controller recovery. The first investigates the characteristics of the ATC

equipment failures from past research and derives the associated target level of safety.

Linking the target level of safety with available operational failure reports establishes a

means to validate the realism and operational significance of the equipment failure

characteristics. A subset of these characteristics relevant to the ATC operations is further

used to develop a novel qualitative equipment failure impact assessment tool. This tool

enables the identification of equipment failures that are most severe to ATC operations

and thus may be most challenging to controller performance.

iii

Having identified the relevant equipment failure types and their characteristics, the thesis

carries out a critical review of the associated issues regarding the process of controller

recovery. A critical element of this is the review of past human reliability research and its

relationship to controller recovery from equipment failures in ATC. The findings from this

are augmented by questionnaire survey results based on responses of 134 air traffic

controllers from 34 countries. Both the past research and the questionnaire survey results

are used to highlight the importance of the context in which controller recovery

performance takes place and to define the recovery context through a set of 20 candidate

contextual factors or Recovery Influencing Factors (RIFs).

The thesis then uses the candidate RIFs to develop a novel approach for the quantitative

assessment of the recovery context through the concept of recovery context indicator. This

approach and its operational benefits are further validated by an experiment conducted in

a training facility of an ATC Centre with the participation of 30 operational air traffic

controllers. In addition to the verification of the generic methodology for the assessment of

the recovery context, the experimental data are used to analyse controller recovery

performance and investigate the outcome of the recovery process. The findings obtained

from the experimental investigation are in line with those obtained from past research and

the ATC operational environment.

iv

Acknowledgements

Having started my research initially at the EUROCONTROL Experimental Centre (EEC) in

Bretigny sur Orge and then at Imperial College London, it is understandable that naming

all those people who have contributed to this work is quite a hard task. However, I will try

anyway and if some names are not listed, my gratitude is not less than for those listed

below.

For help with the funding of my studies, I would like to thank the following organisations:

� EUROCONTROL Experimental Centre (EEC) in Bretigny sur Orge, France for the

award of a graduate internship and a further three-year research studentship;

� Universities UK for the Overseas Research Scheme (ORS) award for three

consecutive years; and

� the Centre for Transport Studies, Department of Civil and Environmental

Engineering, Imperial College London for the contribution to my tuition fee and a

three-year research bursary.

This PhD research would not have been possible without Christian Push and Dirk

Schaefer who invited me initially to join the EUROCONTROL Human Factors group and to

start developing a research project satisfying both the needs of the EEC as well as my

own interests. Once started, this collaboration proved to be highly supportive in both

technical and financial terms. As a EUROCONTROL PhD student I had a privilege of

unlimited access to many aviation experts working “in house”: at the EEC, Headquarters

(Belgium), and the Maastricht Upper Area Control (UAC) Centre (Netherlands). Among

these were Nigel Makings, Catherine Gandolfi, Eric Perrin, Deirdere Bonini, Rachael

Gordon, Andrew Harvey, and the entire Gate-to-Gate (G2G) team and controllers involved

in simulation A and B, especially Diarmuid Houlihan ‘Motto’. I thank them all for the fruitful

collaboration. My special gratitude goes to Barry Kirwan and Oliver Straeter whose

v

technical assistance and unlimited support was crucial to embarking upon the field of

human reliability, completely unknown to me at the beginning of this research. Their

assistance and interest in my research opened many doors and assured the highest

quality of information and professional contacts.

At Imperial College there are many colleagues and research students that offered their

help at various stages and aspects of my work. Among them are Jackie Sime, William

Knottenbelt, Dimitri Panagiotakopoulos, Marie-Dominique Dupuy, Umar Bhatti, Victoria

Williams, and Wolfgang Shuster. However, my biggest gratitude goes to Arnab Majumdar

and to my supervisor, Washington Y. Ochieng. They had a critical role in the support,

supervision, and achievement of excellence in my research. Thanks to their

understanding, I attended various technical meetings, seminars, conferences, courses,

and simulation trials. These proved to be a significant direct and indirect contribution to the

quality of the research presented in this thesis.

One of the critical parts of the research presented in this thesis would not be feasible

without the technical support of the Irish Aviation Authority staff, especially Nick Lowth,

Bernard Mackessy, and Garrett MacNamara. However, my special gratitude goes to Alan

Byrne for making the impossible truly possible and allowing me to complete successfully a

key part of this research and make it complete.

There are many other people that have helped in various ways. I would like to thank Yvette

Dalle-Mule, Veronique Begault, and Sonja Straussberger from EUROCONTROL EEC.

Furthermore, I would like to thank Rajkumar Pant from the Indian Institute of Technology,

Isa Alkalaj and Marek Bekier from Skyguide, Martin Richards and Vic Burgess from UK

NATS, Christopher Adams from Maastricht UAC, Bob Phillips from CASA Australia, Peter

Nalder from New Zealand Civil Aviation Authority (CAA), Jos Kuijper and Randal de Garis

from EUROCONTROL, Sarah Doherty and Joji Waites from the UK CAA, and Keshava

Sharma from the Airports Authority of India.

I want to thank my friend Tamara Pejovic for all the support that she gave me during the

years I have been working on this thesis. Last but not least, I want to express my deepest

gratitude to my brother and my mother who were always the core support in all the

journeys that I have embarked upon. Hence, I am dedicating this thesis to them.

vi

Table of Contents

DECLARATION ii ABSTRACT iii ACKNOWLEDGEMENTS v TABLE OF CONTENTS vii LIST OF FIGURES xiv LIST OF TABLES xvii LIST OF ABBREVIATION xix

1 INTRODUCTION 1 1.1 Background to the problem 1 1.2 Research objectives 4 1.3 Outline of the thesis 5

2 FUNDAMENTALS OF AIR TRAFFIC MANAGEMENT AND CONTROL 8 2.1 Air Traffic Management 8 2.2 Air Traffic Control 10

2.2.1 Area Control service 11 2.2.2 Approach Control service 12 2.2.3 Aerodrome control service 12

2.3 Overall Air Traffic Control system architecture 13 2.3.1 Air Traffic Control functionalities 15

2.3.1.1 Communication function 15 2.3.1.2 Navigation function 18

2.3.1.2.1 Approach and landing navigation 19 2.3.1.2.2 Area navigation 20 2.3.1.2.3 Systems for control and monitoring of ground-based airport 22

facilities 2.3.1.3 Surveillance function 22

2.3.1.3.1 Radar systems 23 2.3.1.3.2 Radar and auxiliary display 24 2.3.1.3.3 Terminal and ground surveillance 24

2.3.1.4 Data processing and distribution function 25 2.3.1.5 Supporting function 28 2.3.1.6 Safety Nets 29 2.3.1.7 Power supply 30 2.3.1.8 Pointing and input devices 31 2.3.1.9 System control and monitoring function 31

2.4 Characteristics of the generic Air Traffic Control Centre 32 2.5 The future of Air Traffic Control 34

vii

2.5.1 Challenges of automation 34 2.5.2 Human-centred vs. technology-centred automation 36 2.5.3 The future of air navigation service 37 2.5.4 Impact of future ATM/ATC on controller recovery from equipment failures 38

2.6 Summary 39

3 PRELIMINARY ASSESSMENT OF EQUIPMENT FAILURES IN AIR TRAFFIC 41 CONTROL 3.1 Definition of equipment failure 42 3.2 Definition of a hazard 44 3.3 Supporting data: operational failure reports 45

3.3.1 Reporting and data collection 46 3.3.2 Data pre-processing problems 47 3.3.3 Available operational failure reports 49

3.4 Methodology to assess the relevance of supporting data 51 3.4.1 The accident to incident ratio 51 3.4.2 Units of measurement 53 3.4.3 The acceptable risk or target level of safety (TLS) 55

3.4.3.1 Existing standards 55 3.4.3.1.1 Joint Aviation Authority 56 3.4.3.1.2 UK Civil Aviation Authority 58 3.4.3.1.3 International Civil Aviation Organisation 58 3.4.3.1.4 Summary of the various TLS analyses 60

3.4.4 Target level of safety and Air Traffic Control risk budgeting 62 3.4.5 Target level of safety and Air Traffic Control equipment risk budgeting 63

3.5 Preliminary analysis and validation of operational failure reports 65 3.6 Summary 67

4 EQUIPMENT FAILURES AND TECHNICAL DEFENCES IN AIR TRAFFIC CONTROL 69 4.1 Equipment failure characteristics 69

4.1.1 ATC functionality affected 70 4.1.2 Complexity of failure type 71 4.1.3 Time course of failure development 71 4.1.4 Duration of failure 72 4.1.5 Potential causes of equipment failures 72

4.2 Consequences of equipment failure 73 4.2.1 Impact on air traffic controller 73 4.2.2 Impact on operations room 73 4.2.3 Impact on ATC operations 74 4.2.4 Impact on ATM operations 79

4.3 Definition of technical defences (technical recovery) 80 4.3.1 Defences for recovering from failure (safety devices) 82 4.3.2 Defences for transmitting information regarding the failure (warning devices) 83

4.4 Analysis of operational failure reports 85 4.4.1 Data analysis methodology 85 4.4.2 Rate of equipment failures 89 4.4.3 Type of ATC functionality and equipment affected 91 4.4.4 Complexity of failure type 95 4.4.5 Severity of equipment failures 96 4.4.6 Duration of equipment failures 98 4.4.7 Additional statistical tests 100

viii

4.5 Qualitative equipment failure impact assessment tool 101 4.6 Summary 107

5 AIR TRAFFIC CONTROLLER RECOVERY 109 5.1 Human recovery in air traffic control 109

5.1.1 Recovery by air traffic controllers 110 5.1.2 Recovery by system control and monitoring engineers 110

5.2 Phases of the controller recovery process 111 5.2.1 Detection 113 5.2.2 Diagnosis 116 5.2.3 Correction 117

5.3 Outcome of the recovery process 119 5.4 Models of human recovery 121

5.4.1 Model by Kanse 122 5.4.2 RAFT Tool 123 5.4.3 Model by Wickens et al. 124

5.5 Procedures for handling ATC equipment failures 126 5.5.1 Existing regulations 127

5.5.1.1 International regulation 127 5.5.1.2 European and national regulation 128 5.5.1.3 Air navigational service provider regulation 128

5.5.2 Main principles on recovery procedures in ATC 130 5.6 Training for handling ATC equipment failures 131

5.6.1 Existing regulations 131 5.6.1.1 International regulation 131 5.6.1.2 European and national regulation 132

5.6.1.2.1 UK Civil Aviation Authority regulation 132 5.6.1.3 Air navigational service provider regulation 133

5.6.2 Areas of concern related to recovery training 133 5.7 Definition of controller recovery performance in this thesis 135

5.7.1 Recovery context 135 5.7.2 Recovery effectiveness 136 5.7.3 Recovery duration 136

5.8 Summary 137

6 QUESTIONNAIRE SURVEY 139 6.1 Objectives of the questionnaire survey 140 6.2 sampling 141 6.3 Survey methodology 143 6.4 Design of the questionnaire 144 6.5 Pilot survey 146 6.6 Full survey 147

6.6.1 Face-to-face interviews 147 6.6.2 Self-completion survey 147 6.6.3 Potential sources of errors 148

6.7 Methodology for the questionnaire survey data analysis 149 6.7.1 Data pre-processing for analysis 150 6.7.2 Characteristics of the sample 151

6.7.2.1 Sampling per ATC Centre 154 6.7.2.2 Sampling of air traffic controllers 154

6.7.3 High-level analyses 155

ix

6.7.3.1 Experience with equipment failures (Q1) 156 6.7.3.2 Factors that influence the controller recovery performance (Q2) 156 6.7.3.3 The most unreliable ATC systems/tools (Q3) 158 6.7.3.4 Organised exchange of information on equipment failures (Q4) 163 6.7.3.5 Status and quality of recovery procedures (Q5) 164

6.7.3.5.1 Other findings regarding the recovery procedures 167 6.7.3.6 Status and quality of training for recovery (Q6) 168

6.7.3.6.1 Other findings on training for recovery 170 6.7.3.7 Other findings on recovery performance 171

6.7.4 Interaction analyses 171 6.8 Summary 175

7 METHODOLOGY FOR A SELECTION OF RELEVANT AIR TRAFFIC CONTROLLER 178 RECOVERY INFLUENCING FACTORS

7.1 Relevance of the recovery context 178 7.1.1 Example of the recovery context 180

7.2 Methodology to extract the candidate set of contextual factors 181 7.2.1 Human Reliability Assessment techniques 183

7.2.1.1 Human Error in ATM (HERA) 183 7.2.1.2 Technique for the Retrospective and Predictive Analysis of Cognitive 184

Errors in ATC (TRACEr) 7.2.1.3 Recovery from Automation Failure (RAFT) Tool 185 7.2.1.4 Recovery from failures: understanding the positive role of human 186

operators during incidents 7.2.1.5 Computerised Operator Reliability and Error Database (CORE-DATA) 187 7.2.1.6 Technique for Human Error Rate Prediction (THERP) 188 7.2.1.7 Human Error Assessment and Reduction Technique (HEART) 190 7.2.1.8 The Contextual Control Model (COCOM) 191 7.2.1.9 Cognitive Reliability and Error Analysis Method (CREAM) 192 7.2.1.10 Human Reliability Management System (HRMS) 193 7.2.1.11 A Technique for Human Event Analysis (ATHEANA) 194 7.2.1.12 Connectionism Assessment of Human Reliability (CAHR) 195 7.2.1.13 Nuclear Action Reliability Assessment (NARA) 196 7.2.1.14 Human Performance DataBase (HPDB) 197 7.2.1.15 Summary of the findings 198

7.2.2 Augmentation with equipment-failure related factors 200 7.2.3 Augmentation with dynamic situational factors 200 7.2.4 Further subdivision of the identified RIFs 201

7.3 Definition of qualitative descriptors 202 7.4 Summary 204

8 QUANTITATIVE ASSESSMENT OF THE RECOVERY CONTEXT 206 8.1 Lessons leant from past research 206

8.1.1 Application of the CREAM technique 207 8.1.2 Connectionism Assessment of Human Reliability (CAHR) 208

8.2 Framework for the methodology for a quantitative assessment of recovery context 209 8.3 Probabilistic assessment of RIFs (Step 2) 211

8.3.1 Sources of information 212 8.3.1.1 Operational failure reports 212 8.3.1.2 Questionnaire survey 213 8.3.1.3 Input by ATM Specialists 213

x

8.3.1.4 Past literature 216 8.3.1.5 Aggregation of data 216

8.3.2 Summary 217 8.4 Interactions between Recovery Influencing Factors (Step 3) 218

8.4.1 Identification of RIF interactions 218 8.4.2 Validation of RIF interactions 221

8.4.2.1 CREAM 221 8.4.2.2 CAHR 221 8.4.2.3 Validation by ATM specialists 222 8.4.2.4 Validation summary 223

8.4.3 Quantification of RIFs interactions 223 8.5 Methodology for the determination of the cut-off points (Step 4) 227 8.6 Specific effects of RIFs on controller recovery performance (Step 5) 231 8.7 Calculation of the recovery context indicator (Step 6) 232

8.7.1 Re-calculation of RIF probabilities 232 8.7.2 Distribution of the recovery context indicator 234 8.7.3 Sensitivity analysis 236 8.7.4 Optimal solutions 237

8.8 Summary 238

9 EXPERIMENTAL INVESTIGATION OF THE AIR TRAFFIC CONTROLLER 240 RECOVERY PERFORMANCE

9.1 High-level design of the experimental process 241 9.2 Rationale for the experiment 242 9.3 Assessment of the available resources 242 9.4 Planning for the experiment 243 9.5 Design of the experiment 244 9.6 Selection of the equipment failure to be simulated 246 9.7 Pilot study: lessons learnt 249

9.7.1 Summary of the findings from the pilot study 252 9.8 Experimental set up 253

9.8.1 Airspace characteristics 256 9.8.2 Traffic characteristics 257 9.8.3 Equipment failure characteristics 257

9.9 Experimental variables 259 9.9.1 Independent Variables 260

9.9.1.1 Recovery Influencing Factors (RIFs) 260 9.9.1.2 Required recovery steps 263

9.9.2 Dependent Variables 264 9.9.2.1 Recovery effectiveness 264 9.9.2.2 Recovery duration 266

9.9.3 Extraneous Variables 267 9.10 Potential limitations 268 9.11 Summary 268

10 ANALYSIS OF EXPERIMENTAL RESULTS 270 10.1 Overall framework 270 10.2 Participants 271

10.2.1 Age and operational experience 272 10.2.2 Ratings 272

10.3 Assessment of controller recovery performance 274

xi

10.3.1 Recovery context 274 10.3.1.1 Assessment of relevant RIFs 274 10.3.1.2 Probabilities of each RIF and its corresponding level 275 10.3.1.3 Interactions between RIFs 276 10.3.1.4 Recovery context indicator (Ic) 276 10.3.1.5 Optimal solutions 280

10.3.1.5.1 Impact of enhancing ‘recovery procedure’ on recovery 281 context

10.3.2 Required recovery steps 283 10.3.3 Recovery effectiveness 285 10.3.4 Recovery duration 286 10.3.5 Outcome of the recovery process 289 10.3.6 Interactions 291 10.3.7 Other findings 292

10.3.7.1 The recovery phases 292 10.3.7.1.1 Detection 292 10.3.7.1.2 Diagnosis 293 10.3.7.1.3 Correction 293

10.3.7.2 Observed behaviour and attitude 295 10.3.7.3 Additional findings 296

10.4 Summary 299

11 CONCLUSIONS 301 11.1 Revisiting the research objectives 301 11.2 Conclusions 301

11.2.1 Literature review 301 11.2.2 Equipment failure types and their characteristics 302 11.2.3 Controller recovery performance, recovery context, and influencing factors 303 11.2.4 Framework for the analysis of controller recovery 305

11.3 Future work 306 11.4 Publications relating to this work 307

11.4.1 Publication format: journal – accepted subject to revision 308 11.4.2 Publication format: journal – published 308 11.4.3 Publication format: conference proceedings - published 308

12 LIST OF REFERENCES 309

APPENDICES 323 Appendix I The cost of delays induced by equipment failures 324 Appendix II Interviews with ATM staff 326 Appendix III Checklist for the Equipment Failure Scenarios in a specific European 329

ATC Centre - An Aide-Memoire framework Appendix IV The questionnaire design 341 Appendix V Example of one questionnaire response 348 Appendix VI Results extracted from question 5 of the questionnaire survey 354 Appendix VII Overview of contextual factors 359 Appendix VIII Probabilities for 20 Recovery Influencing Factors (RIFs) 361 Appendix IX Questions for the ATM Specialist 375 Appendix X Overview of RIFs, their corresponding levels, and designated 378

probabilities Appendix XI Validation of the RIFs interaction matrix 381

xii

Appendix XII Distribution of 20 Recovery Influencing Factors (RIFs) 383 Appendix XIII Experimental material 385

Appendix XIV Overview of RIFs, their corresponding levels, determined in the experimental investigation

and probabilities 402

Appendix XV Distribution of the recovery context indicator captured in the experiment

404

xiii

List of Figures

Figure 1-1 Overview of the thesis 7 Figure 2-1 Air transport system (from Subotic et al., 2005) 9 Figure 2-2 Flight profile (adapter from ICAO, 2001b) 10 Figure 2-3 ATM and ATC system components (adapted from ICAO, 2001a) 14 Figure 2-4 Communication function 16 Figure 2-5 Navigational function 19 Figure 2-6 Surveillance function 23 Figure 2-7 Data processing and distribution function 26 Figure 2-8 Supporting function 29 Figure 2-9 System monitoring and control function 31 Figure 3-1 Phases of an equipment failure occurrence 41 Figure 3-2 Different definitions 43 Figure 3-3 Reporting system 46 Figure 3-4 ”Bathtub” model of reliability for electronic components (Leveson, 50

1995) Figure 3-5 Aviation TLS and risk budgeting 64 Figure 4-1 Safety through design (adapted from Christensen and Manuele, 81

1999) Figure 4-2 Technical and human recovery 82 Figure 4-3 Operational failure reports analyses 87 Figure 4-4 Total number of equipment failures per flight hours flown in each 90

year for countries A, B, and C Figure 4-5 Total number of equipment failures per flight hours flown in each 90

year for country D (year 2000 incomplete) Figure 4-6 Most affected ATC functionality (Country A) 91 Figure 4-7 Most affected ATC functionality (Country B) 92 Figure 4-8 Most affected ATC functionality (Country C) 92 Figure 4-9 Most affected ATC functionality (Country D) 93 Figure 4-10 Distribution of equipment failures according to their severity 96 Figure 4-11 Distribution of major equipment failures according to ATC 97

functionality Figure 4-12 Distribution of the failure duration according to four distinct 99

categories Figure 4-13 Qualitative equipment failure impact assessment tool 105 Figure 5-1 Analysis of outcome phase (adapted from EUROCONTROL, 2004e) 120 Figure 5-2 Recovery process phase model (Kanse, 2004) 123 Figure 5-3 The Recovery from Automation Failure Tool (RAFT) Framework 124

(EUROCONTROL, 2004e) Figure 5-4 Model of failure recovery in air traffic control. Where two nodes are 125

xiv

Figure 6-1 Figure 6-2

Figure 6-3 Figure 6-4 Figure 6-5 Figure 6-6 Figure 6-7 Figure 6-8

Figure 6-9

Figure 6-10

Figure 6-11

Figure 7-1 Figure 8-1 Figure 8-2

Figure 8-3

Figure 8-4

Figure 8-5

Figure 8-6

Figure 8-7

Figure 8-8 Figure 9-1 Figure 9-2 Figure 9-3 Figure 9-4

Figure 10-1 Figure 10-2 Figure 10-3 Figure 10-4 Figure 10-5

Figure 10-6 Figure 10-7 Figure 10-8 Figure 10-9 Figure 10-10 Figure 10-11

connected by an arrow, signs (+, -, 0) indicate the direction of effect on the variable depicted in the right node, caused by an increase in the variable depicted in the left node (Wickens et al., 1998) The flow diagram of organising a survey 140 Distribution of world air traffic per region for the year 2003 and 2023 142 (adapted from Airbus, 2004) One-page example of the questionnaire 146 The flow chart of questionnaire survey analyses 150 Distribution of questionnaire responses per region 153 Distribution of operational experience 155 Distribution of air traffic controllers’ ratings 155 Controllers’ reliance on written procedures throughout the recovery 157 process Controllers’ reliance on situation-specific problem solving throughout 157 the recovery process Controllers’ reliance on past experience throughout the recovery 158 process Distribution of affected ATC functionalities as reported in the 159 questionnaire survey Methodology to extract a candidate set of RIFs 182 Framework for the quantitative assessment of the recovery context 210 Distribution of RIF5 levels amongst identified recovery contexts 226 without interactions Distribution of RIF5 levels amongst identified recovery contexts with 226 interactions Distribution of RIF1 levels amongst identified recovery contexts with 227 interactions Distribution of RIF20 levels amongst identified recovery contexts with 227 interactions Distribution fitting for the three cut-off points on the example of RIF5 229 Level 1 Cubic polynomial function f(x) fitted for the RIF5 to determine its 230 minimum Distribution of the recovery context indicator 235 The flow diagram of experimental investigation 241 Timeline of the experiment 254 Room setup 255 The visual representation of equipment failure on CWP: a) before the 258 failure, b) after the failure Framework for the analysis of experimental results 271 Distribution of operational experience 272 Distribution of controllers’ ratings 273 Distribution of the recovery context indicator in the experiment 277 Distribution of the recovery context indicator in the experiment with 279 an increased value of the coefficient of interaction Distribution of the recovery context indicator of 30 controllers 280 Recovery steps performed by each participant 283 Distribution of required recovery steps (S1 to S17) 284 Distribution of recovery effectiveness per category 286 Distribution of recovery duration 287 Distribution of the recovery outcome 290

xv

295 Figure 10-12 Recovery phases, their corresponding influencing factors, and required recovery steps

xvi

List of Tables

Table 3-1

Table 3-2 Table 3-3 Table 4-1

Table 4-2 Table 4-3 Table 4-4 Table 4-5

Table 4-6 Table 4-7 Table 4-8 Table 4 9 Table 4-10 Table 4-11

Table 4-12 Table 4-13


Table 4-16

Table 4-17


Table 6-3


Table 7-2

Table 7-3

Summary of available data, number of reports, and equipment failure 49 incidents per country Summary of various analyses on aviation TLS 61 Analysis of operational failure reports and results 66 Examples of equipment failures related to different ATC system 70 functionalities (as defined in Chapter 2) UK NATS severity rating (from NATS, 2002) 75 Country C’s severity rating as defined by its CAA 76 Country D severity rating as defined by the particular ATC Centre 76 Severity rating defined in this research and mapped with available 77 sources Most affected ATC equipment (Country A) 91 Most affected ATC equipment (Country B) 92 Most affected ATC equipment (Country C) 93 Most affected ATC equipment (Country D) 94 Summary of five ATC equipment types most affected by failures 94 Percentage of the multiple failure occurrences reported in the 95 available datasets Summary of five most affected equipment types from four datasets 98 Distribution of major failures lasting up to 15 minutes per ATC 99 equipment affected Statistical tests and results obtained 100 Main findings regarding interaction between ATC functionality and 101 severity Review of equipment failure characteristics with regard to their 101 impact on ATC operations Detailed overview of the primary and the secondary group of ATC 103 functionalities Phases of the recovery process identified in past research 112 Summary of relevant models of the human recovery process 126 Summary of the questionnaire survey sample 151 Mapping between most unreliable ATC functionalities and existing 160 recovery procedures for sampled worldwide countries Existence of recovery procedures, recovery training, and recurrent 165 training as reported in the questionnaire survey Interaction matrix 172 Statistical tests and results obtained 173 Factors influencing recovery from failures (from Kanse and van der 186 Schaaf, 2000) Factors influencing human actions in THERP (cited in Straeter, 189 2000) Review of Human Reliability Assessment (HRA) techniques and 198

xvii

Table 7-4 Table 7-5


Table 8-5

Table 8-6

Table 8-7 Table 8-8

Table 8-9

Table 8-10 Table 8-11 Table 8-12 Table 8-13 Table 9-1 Table 9-2

Table 9-3 Table 9-4

Table 9-5 Table 9-6

Table 9-7 Table 9-8 Table 9-9 Table 9-10 Table 10-1 Table 10-2



Table 10-8



relevant findings Recovery Influencing Factors 201 Relevant recovery influencing factors and their corresponding 203 qualitative descriptors Overview of CREAM and CAHR differences 208 Distribution of probabilistic RIF ratings per source 212 ATM specialists involved in the assessment of RIFs 214 Overview of the sources of information used to determine RIF 217 probabilities Example of a potential recovery context represented as a 20-digit 218 array Interaction matrix: (1) validation by CREAM, (2) validation by CAHR, 220 (3) validation by ATM specialists; and (x) not validated interactions Mapping between RIFs and CAHR contextual factors 222 Recovery context (as presented in Table 8-5) after the incorporation 225 of RIF interactions Descriptive statistics for the three cut-off points on the example of 229 RIF5 Level 1 Local minimums of polynomial functions 230 Cut-off points between the levels for all RIFs 230 Probabilities for the RIF5 and each of its levels (see Appendix VII) 232 Sensitivity analysis 237 Training, pilot study, and experiment sessions 244 Overview of the potential equipment failures to be simulated and 247 their inclusion in the pilot study Equipment failures used in the pilot study 249 The mapping between exercise characteristics and the controllers 257 observations Equipment failure in the experimental study 258 Availability of functions in the reduced flight data processing mode 259

Overview of independent and dependent variables 259 Overview of independent and extraneous variables 261 Overview and description of required recovery steps 263 Recovery process and its three main tasks 265 Characteristics of a sample of controllers participating in experiment 273 Verification of RIFs probabilities from a ‘generic’ approach (Chapter 275 8) and the experiment Summary of RIFs defined through a single corresponding level 277 Verification of the distribution of the recovery context indicator 278 obtained from a ‘generic’ approach (Chapter 8) and the experiment A review of RIFs with the potential for recovery enhancement 281 A review of the proposed recovery solutions 282 Percentage of performed recovery steps in three experimental 285 sessions Comparison of recovery durations between three experimental 288 sessions Statistical tests and results 289 The outcome of the recovery process matrix (S stands for 290 successful, T for tolerable, and U for unsuccessful recovery) Statistical tests and results 291 Summary of additional findings 299

xviii

List of Abbreviations

ACAS Airborne Collision Avoidance System ACC Area Control Centre ADREP Accident/Incident Reporting ADS Automatic Dependent Surveillance ADS-B Automatic Dependence Surveillance Broadcast ADS-C Automatic Dependence Surveillance Contract AFTN Aeronautical Fixed Telecommunication Network A/G Air-Ground communication AGDP Air Ground Data Processor AGL Aeronautical Ground Lighting AIAA American Institute of Aeronautics and Astronautics AIS Aeronautical Information Service AMAN Arrival Manager ANSP Air Navigation Service Provider APP Approach Control Office APR Automatic Position Reporting APW Area Proximity Warning ARO Air traffic services Reporting Office ARTCC Air Route Traffic Control Centre ASAS Airborne Surveillance and Separation Assurance ASM Airspace Management ASMT ATM Safety Monitoring Tool ASMT Automatic Safety Monitoring Tool ASTERIX All Purpose STructured Eurocontrol Radar Information

Exchange ATC Air Traffic Control ATCT Air Traffic Control Tower ATFM Air Traffic Flow Management ATHEANA A Technique for Human Event Analysis ATIS Aeronautical Terminal Information Service ATM Air Traffic Management ATS Air Traffic Service AWOP All-Weather Operations Panel BBN Bayesian Belief Network BEST Beginning to End Skills Trainer BEVOR German special occurrences database CAA Civil Aviation Authority CAHR Connectionism Assessment of Human Reliability

xix

CATIS Computerised Automatic Terminal Information Service CC Contextual Condition CLAM Cleared Level Adherence Monitoring CEATS Central European Air Traffic Services CFMU Central Flow Management Unit CMS Control and Monitoring System CNS Communication Navigation Surveillance COCOM Contextual Control Model CORE-DATA Computerised Operator Reliability and Error Database CPC Common Performance Condition CPDLC Controller Pilot Data Link Communication CPM Common Performance Modes CRDS CEATS Research, Development and Simulation CREAM Cognitive Reliability and Error Analysis Method CS Commercial Service CWP Controller Working Position DARC Direct Access Radar Channel DMAN Departure Manager DME Distance Measuring Equipment EASA European Aviation Safety Agency ECAC European Civil Aviation Conference ECSS European Cooperation for Space Standardisation EGNOS European Geostationary Navigation Overlay Service EOC Errors Of Commission EOO Errors of Ommission EPC Error Producing Condition ESA European Space Agency ESSAR EUROCONTROL SAfety Regulatory Requirements ET Event Tree EU European Union EUROCONTROL European Organization for Safety of Air Navigation FAA Federal Aviation Administration FANS Future Navigation System FDPD Flight Data Processing and Distribution FDPS Flight Data Processing System FIR Flight Information Region FIS Flight Information Service FL Flight Level FMEA Failure Mode and Effect Analysis FMECA Failure Modes, Effects, and Criticality Analysis FMS Flight Management System FPP Flight Plan Processing FPS Flight Progress Strips FT Fault Tree G2G Gate to Gate G/G Ground-Ground communication GLONAS Global Orbiting Navigation Satellite System GNSS Global Navigation Satellite Systems GPS Global Positioning System HEART Human Error Assessment and Reduction Technique HEIDI Harmonisation of European Incident Definition Initiative

xx

HEP Human Error Probability HFACS Human Factors Analysis and Classification System HEP Human Error Probability HERA Human Error in ATM Project HF High Frequency HF DL High Frequency Data Link HMI Human Machine Interface HPDB Human Performance DataBase HRA Human Reliability Assessment HRMS Human Reliability Management System IANS Institute of Air Navigation Services IC Intercom Ic recovery Context Indicator ICAO International Civil Aviation Organization IEC International Electrotechnical Commission IEEE Institute of Electrical and Electronics Engineers IFR Instrument Flight Rules ILS Instrument Landing System IMC Instrument Meteorological Conditions IMC Industry Management Committee INS Inertial Navigation Systems IP Interphone IRS Incident Reporting System ISO International Organisation for Standardisation JAA Joint Aviation Authority JAR Joint Aviation Regulations JHEDI Justification of Human Error Data Information M Mean

MAESTRO Means to Aid Expedition and Sequencing of Traffic with Research and Optimisation

MANTAS Maastricht ATC New Tools And Systems MATS Manual of Air Traffic Services MDT Mean Down Time MET Meteorological METAR Meteorological Aerodrome Report Mil Military MLS Microwave Landing System MMI Man Machine Interface MMS Man Machine System MONA MONitoring Aids MORS Mandatory Occurrence Reporting Scheme MRP Multi Radar Processing MSAW Minimum Safe Altitude Warning MSL Mean Sea Level MTBF Mean Time Between Failure MTBM Mean Time Between Maintenance MTCD Medium Term Conflict Detection MTTR Mean Time To Repair MUAC Maastricht Upper Area Control Centre NATSPG North Atlantic Systems Planning Group MTOW Maximum Take Off Weight

xxi

NARA Nuclear Action Reliability Assessment NAIPS National Aeronautical Information Processing System NAS National Aviation System NASA National Aeronautics and Space Administration NATS National Air Traffic Service NUCLARR Nuclear Computerise Library for Assessing Reactor Reliability NDB Non-Directional Beacon NLR National Aerospace Laboratory NOTAM Notice to Airmen NTL National Transportation Library NTSB National Transportation Safety Board OJT On-the-Job-Training OLDI On-line Data Interchange OS Open Service PABX Private Automatic Branch Exchange PAR Precision Approach Radar PARM Parallel Approach Runway Monitor PPS Precise Positioning Service PRA Probabilistic Risk Assessment PRNAV Precision aRea NAVigation PRS Public Regulated Service Proc Procedural control PRS Primary Radar Service PSA Probabilistic Safety Assessment PSF Performance Shaping Factor PSR Primary Surveillance Radar PTT Press To Talk QRA Quantitative Risk Assessment RAFT Recovery from Automation Failure Tool RAM Route Adherence Monitoring RCP Required Communication Performance RDP Radar Data Processing RDPS Radar Data Processing System RDR Radar RGCSP Review of the General Concept of Separation Panel RIF Recovery Influencing Factor RIMCAS Runway Incursion Monitoring and Conflict Alert System RNP Required Navigational Performance RSP Required Surveillance Performance RT Radio Telephony RTCA Radio Technical Commission for Aeronautics RVSM Reduced Vertical Separation Minima RVR Runway Visual Range RWY Runway SAR Special Administrative Region SAR Search And Rescue SAS Situational Awareness for Safety SATCOM SATellite COMmunication SHAPE Solutions for Human Automation Partnership in European ATM SBAS Satellite-Based Augmentation Systems SBJ Supersonic Business Jet

xxii

SD Standard Deviation SE Standard Error SEP Safety and Emergency Procedures SES Single European Sky SID Standard Instrument Departure SME Subject Matter Expert SMC Surface Movement Control SMR Surface Movement Radar SNET Safety Nets SoL Safety-of-Life SOR Stimulus-Organism-Response SPS Standard Positioning Service SRG Safety Regulatory Group SRK Skill Rule Knowledge SRP Single Radar Processing SRU Safety Regulatory Unit SSR Secondary Surveillance Radar STAR Standard Terminal Arrival Route STCA Short Term Conflict Alert SUA Special Use Airspace SYSCO System Supported COordination TACAN TACtical Air Navigation THERP Technique for Human Error Rate Prediction TAR Terminal Approach Radar TCAS Traffic Alert and Collision Avoidance System TID Touch Input Device TRACON Terminal Radar Approach CONtrol TIP Touch Input Panels TLS Target Level of Safety TRACEr Technique for the Retrospective and Predictive Analysis of

Cognitive Errors in ATC TRACON Terminal Radar Approach CONtrol TRUCE TRaining for Unusual Circumstances and Emergencies TRM Team Resource Management TTA Time To Alert TWR Aerodrome Control Tower TWY Taxiway UAV Unmanned Aerial Vehicles UHF Ultra High Frequency UPS Uninterruptible Power Supply US United States UTC Coordinated Universal Time VDL Very high frequency Data Link VFR Visual Flight Rules VHF Very High Frequency VMC Visual Meteorological Conditions VOR VHF Omnidirectional Range navigation system VORTAC VHF Omnidirectional Range /TACtical Air Navigation VSCS Voice Switching Communication System WAAS World Aircraft Accident Summary

xxiii

Chapter 1 Introduction

1

1 Introduction

The aim of this Chapter is to present the background to the problem of controller

recovery from equipment failures in Air Traffic Control (ATC) and to set the scene for

the research presented in this thesis. This Chapter defines the rationale behind the

need to better understand the impact that equipment failures have on controller

performance in the current as well as in the future ATC environment. Based on this

background, the principle research objectives are defined to assure an in depth

analysis of ATC equipment failures and controller recovery. This is followed by the

specification of the structure of the thesis and a summary of each Chapter.

1.1 Background to the problem

The aim of the research presented in this thesis is to provide a holistic assessment of

controller recovery from equipment failures in ATC. In order to achieve this, it is

essential to define the environment in which equipment failures are investigated, i.e.

the Air Traffic Management (ATM) system and its ATC component. While ATC is

responsible for the separation of air traffic, other components of the ATM system

manage air traffic flow and airspace design to assure minimal delays and optimal use

of airspace. The ATC system is comprised of people, equipment, and procedures

required to act together to achieve the same objective, i.e. safe and efficient flow of air

traffic in a dedicated airspace. In order to achieve this, all three components must be

operational and fully integrated to enable the most effective and efficient air traffic

service. Consequently, in the case of failure of any component of the ATC system, the

remaining nominally operational components may still provide air traffic services, either

partially or fully, depending on the characteristics of the failure. The research presented

in this thesis focuses solely on failures of one component of the ATC system, namely

equipment.

In order to provide continuous air traffic services various ‘defences’ or ‘barriers’ are

designed to prevent or mitigate the occurrence of equipment failures. For example, the

existence of technical built-in defences offers protection against the majority of


2

equipment failures that can occur (NATS, 2002). In most cases, this protection is

triggered automatically and seamlessly. Hence, an equipment failure should not result

in a problem that impacts on the controller’s ability to carry out tasks safely, as they

should be automatically resolved with no interruption of the service (EUROCONTROL,

2004e). However, there are occasions when these technical defences are not sufficient

to maintain the normal ATC system state and protect against negative outcomes. On

such occasions, the intervention of the human, as a component of the ATC system, is

necessary. In other words, the intervention of the air traffic controller becomes crucial

for the provision of a safe but not necessarily efficient air traffic service. Note that

safety represents the key driver here as opposed to efficiency.

In the past, major failures or total outages (i.e. failure of the entire system) were the

subject of detailed investigations. These investigations were aimed at resolving and

preventing similar failure occurrences by focusing mostly on the technology (National

Transportation Safety Board, 1996; General Accounting Office, 1982; General

Accounting Office, 1991; General Accounting Office, 1996; and General Accounting

Office, 1998). For a long time, the basic focus of reliability, system safety, and quality

management was purely on the prevention of equipment failures or the reduction of

their reoccurrence. Various techniques have been developed to assess equipment

failures, their causes, consequences, and appropriate defences. For example, the US

Federal Aviation Administration (FAA) requests that the availability of the Voice

Switching Communication System (VSCS) on the level of the ATC Centre (facility-

level1) should not be less than 0.9999999, including the backup VSCS (FAA, 1997). In

spite of the significant efforts, equipment failures still occur and every ATC system

eventually fails to perform its intended function or part thereof. On these unexpected

occasions, the recovery of the ATC system is left to the human operator to implement

an appropriate recovery strategy in both a timely and effective manner. While past

research focused on the technical aspects of the occurrence of equipment failures,

very little has been done on human factors, with a particular reference to controller

recovery from such failures. Some examples, such as research by Wickens et al.

(1998), Low and Donohoe (2001), and EUROCONTROL (2004e), are discussed in the

following paragraphs.

1 The facility-level availability is based on a 50-position system. According to the FAA, system failure occurs when one or more critical functions are unavailable in more than 10 percent of the

positions.


3

There is a vast amount of Human Reliability Assessment (HRA) research on recovery

from human error in areas including the nuclear and chemical process industry.

However, this knowledge has not been fully exhausted in aviation. For example, Zapf

and Reason (1994), Kontogiannis (1999), Kanse and van der Schaaf (2000), and

Kanse (2004) analysed recovery from the consequences of human error in various

non-ATC environments. Moreover, past HRA research recognised the importance of

contextual factors that influence the recovery process. Various HRA techniques defined

these factors depending on the type of operation and environment that surrounds the

human operator. In short, the concepts of recovery from human error and recovery

context are transferable to the recovery from equipment failure. Both represent human

recovery triggered by different stimulus (human error as opposed to technical failure)

occurring within a certain context.

The above findings led to a significant research effort being devoted to the area of

human recovery, from both human error and technical faults. For example, research on

automation in future ATM has shown that human operators are less likely to detect

failures in the automated process due to complacency and reduced situational

awareness (Wickens et al., 1998; Metzger and Parasuraman, 2005). Researchers at

the UK National Air Traffic Service (NATS) examined the potential methodologies to

assess human recovery performance from failures of several automated systems (Low

and Donohoe, 2001). Several different safety (e.g. hazard and operability-HAZOP) and

psycho-physiological methods (e.g. eye movement tracking, situational awareness

assessment-SAGAT, subjective workload ratings-NASA TLX, speech workload) were

investigated. While some of these methods are quite easy to implement (e.g. HAZOP,

SAGAT, NASA TLX), others require complex training and the use of sophisticated

equipment (e.g. eye movement tracking, speech workload). Most of these methods

proved to be appropriate, providing useful information and were thus recommended for

future use. Due to the confidential nature of this research, no further insight was given

into the human recovery process, its phases, and the impact of the context surrounding

the controllers.

Furthermore, the EUROCONTROL Gate to Gate (G2G) project, initiated to test future

advanced ATC concepts, further highlighted the impact and importance of ATC

equipment failures. ATC safety managers throughout Europe highlighted several

equipment related areas of concern within their ATC Centres (Gordon and Makings,

2003). These are: radio communication interference, equipment reliability, ATC tools

failure, and relevance of emergency checklists for controllers and appropriate handling


4

of emergency situations. This study highlighted the consequences of equipment

unavailability in current as well as future more automated ATC environments.

Simulation trials that followed attempted to identify and investigate safety-relevant

occurrences associated with future ATC concepts/tools (Medium Term Conflict

Detection-MTCD, MONitoring Aid-MONA, data link, Arrival Manager-AMAN, and

Airborne Separation Assistance System-ASAS). Various equipment failures were

identified amongst the potential safety-relevant occurrences 2 . They ranged from

problems with Human Machine Interface (HMI), ASAS messages, as well as data link

messages (Damidau, Kirwan, and Scrivani, 2006).

However, not many studies have explicitly addressed jointly the question of equipment

failures and recovery in the area of ATC. The Panel on Human Factors in Air Traffic

Control Automation was formed at the request of the Federal Aviation Administration

(FAA) to study the air traffic control system, the national airspace system, and future

automation alternatives from a human factors perspective (Wickens et al., 1998). The

Panel’s deliberations, in particular, highlighted the role of reliability of automation and

human recovery in the future ATC environment, characterised with higher levels of

automation, complexity, and traffic density. Similarly, the EUROCONTROL project on

Solutions for Human Automation Partnership in European Air Traffic Management

(SHAPE) dedicated one part to the analysis of human recovery from equipment failures

in the automated ATC environment. The findings highlighted the importance of context

within which a failure occurs as well as recovery training and procedures designed to

aid recovering (EUROCONTROL, 2004e).

Overall, existing research has shown that there is a need to understand the

mechanisms behind failure and recovery in ATC. This applies both to the technical and

human perspectives as both are essential to ensuring the highest level of safety. In

order to develop a heuristic method to address these issues, it is necessary to define

the major research objectives. These are presented below.

1.2 Research objectives

The need for an in depth analysis of ATC equipment failures and the associated

controller recovery processes is presented briefly above and is discussed in more

2 Personal correspondence with EUROCONTROL G2G project team.


5

detail in the remainder of the thesis. Based on the background to the problem

presented above, four research objectives have been formulated:

� Provide a systematic literature review to connect disparate but related topics of

ATC equipment failures and controller recovery, previously lacking in the area of

ATC;

� Identify potential equipment failure types and their characteristics;

� Identify contextual factors that affect controller recovery performance and derive

a methodology to quantitatively assess recovery context; and

� Propose a framework for the analysis of controller recovery. This framework

should be further verified with a specific reference to a particular equipment

failure type.

1.3 Outline of the thesis

This thesis is organised as follows. Chapter 2 discusses the architecture of the Air

Traffic Management (ATM) system with specific attention paid to its Air Traffic Control

(ATC) component, to portray the context of the research presented in this thesis. The

ATC architecture is presented in terms of nine functionalities and the corresponding

physical architecture (equipment). In other words, it specifies nine ATC functionalities

and equipment that supports each of them. Chapter 3 presents a preliminary

assessment of the equipment failures in ATC based on the sample of operational

failure reports available in this research. It provides definitions of equipment failure,

hazards, and built-in technical defences to be used in the research on recovery from

equipment failures in ATC. The Chapter continues by assessing how representative is

the sample of equipment failures occurring in the operational ATC environment. This is

achieved though a methodology that determines how much ATC equipment contributes

to the safety of the overall air transport system.

Having confirmed that the operational failure reports available in this thesis are

representative of the equipment failure types experienced operationally, Chapter 4

provides a good understanding of equipment failures and their impact on the ATM and

ATC operations. It discusses the main equipment failure characteristics extracted from

available operational failure reports and past research. Assessed characteristics range

from the ATC functionality affected to the impact of equipment failure on ATC and ATM

operations. The Chapter concludes with the development of a novel tool for the

assessment of the overall impact of an equipment failure on ATC operations, known as

the qualitative equipment failure impact assessment tool.


6

Having established the framework for the assessment of equipment failures in

Chapters 3 and 4, Chapter 5 addresses the human factors aspects of relevance to

controller recovery performance in the event of an equipment failure. It discusses past

research on human reliability transferable to controller recovery performance. The

Chapter presents the initial theoretical findings on the recovery process, including the

relevance of the recovery context, past experience, recovery procedures, and recovery

training. It concludes by defining the potential variables that enable the assessment

and understanding of controller recovery performance.

The theoretical findings from Chapter 5 are further informed by the operational

experience extracted from the questionnaire survey results presented in Chapter 6.

This survey informed both the technical and human aspects of the research into

recovery from ATC equipment failures.

Having acknowledged the importance of recovery context both from past research

(Chapter 5) and operational experience (Chapter 6), this thesis continues by setting the

scene for the qualitative and quantitative assessment of the recovery context. Chapter

7 reviews past ATC and non-ATC research to extract the relevant factors important for

the definition of the context surrounding an ATC equipment failure occurrence. As a

result, this Chapter concludes with a set of 20 candidate Recovery Influencing Factors

(RIFs). Chapter 8 reviews relevant past research to further exploit the findings from

Chapter 7. It continues by defining the methodology for the quantitative assessment of

the recovery context and definition of the recovery context indicator.

To further verify this methodology proposed in Chapter 8, Chapter 9 presents the

design of an experiment carried out at a particular ATC Centre that involved exposing

30 operational controllers to an unexpected but complex equipment failure. This

particular equipment failure was carefully selected from several failure types based on

the findings in Chapters 4, 5, and 6. The analyses of the data collected on recovery

performance from this experiment are presented in Chapter 10. These analyses are

based on a set of variables that enable investigation of controller recovery as proposed

in Chapter 5. The thesis ends with Chapter 11 drawing together the conclusions

achieved throughout this research together with suggested areas for further research.

Figure 1-1 crystallises the overall structure of this thesis.


7

Figure 1-1 Overview of the thesis

Chapter 2 Fundamental of ATM and ATC

8

2 Fundamentals of Air Traffic Management and Control

The main objective of the research presented in this thesis is to investigate the

recovery process adopted by air traffic controllers in the event of Air Traffic Control

(ATC) equipment failures. A desirable objective of the research in this thesis is a

framework to analyse controller recovery transferable in time (i.e. to the current and

future ATC Centre). The Chapter contributes to this objective in several ways. Firstly, it

defines the environment for the investigation of equipment failures, i.e. Air Traffic

Management (ATM) and its component ATC. Secondly, it discusses the ATC system

architecture including its specific functional elements. The Chapter proposes a unique

classification of equipment failures based on these functional elements that enables the

capture of all operational components of ATC. This classification is further built upon in

the remainder of the thesis (Chapter 4) to create a qualitative equipment failure impact

assessment tool. Thirdly, the Chapter reviews the characteristics of a generic ATC

Centre with regard to current and future technologies. The potential characteristics of

future ATC Centres are discussed with an emphasis on challenges that face human

operators (i.e. air traffic controllers) due to increasing levels of automation. The

Chapter concludes with discussions on the potential sources of technical and controller

performance deficiencies within future ATC Centres and their relevance to the recovery

process.

2.1 Air Traffic Management

The major components of the air transport system are aircraft, airline operations, ATM,

airport operations, and the operational environment in which these components exist

and interact (Figure 2-1). The objective of ATM is “to enable aircraft operators to meet

their planned times of departure and arrival and adhere to their preferred flight profiles

with minimum constraints, without compromising agreed levels of safety”

(EUROCONTROL, 2006a).


9

Figure 2-1 Air transport system (from Subotic et al., 2005)

An ATM system comprises two functionally integrated elements, namely airborne ATM

and ground-based ATM. The airborne ATM consists of several systems integrated into

the aircraft cockpit, such as the airborne Communication/Navigation/Surveillance

(CNS) system, the Flight Management System (FMS), and the Airborne Collision

Avoidance System (ACAS) also known as the Traffic Alert and Collision Avoidance

System (TCAS). The components of ground-based ATM (Figure 2-1) are Airspace

Management (ASM), Air Traffic Service (ATS), and Air Traffic Flow Management

(ATFM) (ICAO, 2001a).

Airspace Management (ASM) is related to the structure and organisation of the national

airspace organised at a strategic (i.e. national ASM policy, planning, and coordination),

pre-tactical (i.e. daily management and temporary allocation of airspace), and tactical

levels (i.e. real-time activation, deactivation, reallocation of airspace, and civil/military

coordination). Air Traffic Service (ATS) is a generic term that combines various

services: the Air traffic services Reporting Office (ARO), the Air Traffic Control service

(ATC), and the Flight Information and alerting Service (FIS) (ICAO, 2001a). The ARO is

a unit established for the purpose of receiving reports concerning air traffic services

and flight plans submitted before flight departure. The ATC component of ATS provides

control of all air traffic in a dedicated airspace. This is discussed in detail in section 2.2

given its importance to the research presented in this thesis. The Flight Information and

alerting Service (FIS) gives advice and information useful for the safe and efficient

conduct of flights. The alerting service provides search and rescue assistance to

aircraft in distress and coordinates any action that may be required. Finally, Air Traffic

Flow Management (ATFM) is a service established to ensure that ATC capacity is


10

utilised to the maximum extent possible, and that the traffic volumes are compatible

with the capacities declared by the appropriate authority. Optimal flow of traffic is

achieved by continuously balancing the traffic demand and the ability of ATC to

accommodate that demand.

2.2 Air Traffic Control

The research presented in this thesis is focused specifically on controller recovery from

equipment failures in Air Traffic Control (ATC). Therefore, this section focuses on the

main characteristics of ATC and the different services provided. Modern ATC services

are provided from ATC Centres by controllers and supporting staff (engineers,

managers, and administrators), working together to achieve the same objective. The

primary objective of an ATC service is to provide a safe flow of traffic both in the air and

on the ground (EUROCONTROL, 1999). In other words, the primary function is to

prevent collision between aircraft in the air as well as collision between aircraft and any

obstacles on the manoeuvring area, by providing and maintaining the required lateral

and vertical separations. The secondary function of an ATC service include ensuring

orderly and expeditious traffic flow by providing traffic advisories, such as weather

information and navigation directions (i.e. vectors). To achieve these functions, the

service is divided into sections that provide an ATC service to aircraft depending on the

segment of the flight profile, i.e. phase of flight (Figure 2-2). According to the

International Civil Aviation Organisation (ICAO)1, ATC provides area, approach, and

aerodrome control services. These are discussed in the following sections.

Figure 2-2 Flight profile (adapter from ICAO, 2001b)

1 ICAO is the specialised agency of the United Nations concerned with the development of air

navigation and regulation of international air transport.


11

2.2.1 Area control service

The area control service is provided from an Area Control Centre (ACC), as defined by

ICAO. In the US, such a Center is referred to as an Air Route Traffic Control Centre

(ARTCC) as defined by the US Federal Aviation Administration (FAA). The controllers

at ACCs provide instructions, clearances, and advice regarding flight conditions during

the cruise phase of the flight (see Figure 2-2). The controllers provide separation

between aircraft operating in the complex network of airways (predetermined air

routes). The controllers use radar to monitor the progress of flights and intervene when

the route or flight level of an aircraft brings it into conflict with another. This is achieved

through tactical air traffic control interventions such as heading or track change, flight

level change, speed control, or alteration of flight routes. In areas where it is impossible

to provide a radar service (i.e. oceanic airspace and other regions without radar

coverage), the controllers employ procedural (i.e. non-radar) control to ensure that

adequate separation exists between aircraft. Procedural control employs greater

separation standards because of the absence of direct radar surveillance (Nolan, 1998;

EUROCONTROL, 1999).

An ACC is usually sub-divided into controlled airspace sectors2 that have responsibility

for specific portions of airspace. This is a direct result of the large volumes of air traffic

that utilise the airspace in the cruise phase of the flight. The greater airspace is

sectorised into smaller, more manageable parts in an effort to prevent controller

overload (i.e. when the traffic in a sector exceeds available airspace capacity or a

controller is unable to safely control existing levels of air traffic).

Generally, each ATC sector is manned by an executive and planning controller, where

each has clearly defined roles and responsibilities (EUROCONTROL, 1999). In the

case of high traffic complexity, two sector controllers are supported by a third person,

i.e. an assistant or a flight data controller. The executive controller is responsible for the

correct identification of traffic within the sector’s area of responsibility and for the

control of all aircraft to ensure a safe, orderly, and expeditious flow of air traffic.

Additionally, the executive controller is required to assist pilots by providing required

navigation assistance and to assist aircraft in any emergency situation. The planning

controller assists the executive controller to the fullest extent by identifying traffic in

2 Airspace is organised into adjacent portions, the so-called sectors, controlled by two or three

controllers, namely executive or tactical controller, planning controller, and assistant or flight data controller.


12

potential conflict, managing flight progress strips, and planning the flow of traffic within

the sector. In addition, the planning controller has to assure that traffic enters and

leaves the sector at flight levels and exit points as agreed with the adjacent sectors

(EUROCONTROL, 1999). The assistant or flight data controller ensures that the strip

printer functions properly. In addition, the assistant accepts, processes all received

messages in a timely manner, and passes them to the appropriate position, manually

inputting any tracks for which flight progress strips have not been produced.

The controllers operating in the sectors within an ACC Centre work in close

cooperation and negotiate with each other on aircraft’s behalf to optimise efficiency and

ensure safety. The area controller’s responsibility terminates when aircraft is handed

over to an adjacent ACC or to an approach control office.

2.2.2 Approach control service

The approach control service is provided from the APProach control office or room

(APP), as defined by ICAO or Terminal Radar Approach CONtrol (TRACON), as

defined by the FAA. According to ICAO (2001a) the approach control unit is

established to provide air traffic control service to controlled flights arriving at, or

departing from, one or more airports. This service is closely associated with the

characteristics of the airports. The radar controllers in the approach control office

provide separation between aircraft in descent during the arrival phase, and, during the

departure phase, between aircraft climbing to their assigned cruise or intermediate

assigned levels (see Figure 2-2). Therefore, the approach controllers are responsible

for providing a safe and expeditious service to departing aircraft in the initial phase of

flight and to arriving aircraft in the descent and final phases of flight (Nolan, 1998;

EUROCONTROL, 1999). The approach controller’s responsibility terminates when

departing aircraft is handed over to an ACC or when arriving aircraft has landed. Note

that APP is responsible for monitoring approaching aircraft, even after they are

transferred to aerodrome control tower, until they land.

2.2.3 Aerodrome control service

The aerodrome control service is provided from the Aerodrome Control Tower (TWR),

as defined by ICAO or Air Traffic Control Tower (ATCT), as defined by the FAA. The

aerodrome controllers are responsible for the safe and efficient conduct of flights during

the take-off and landing phases. These controllers direct airport traffic so that it flows

smoothly and expeditiously. Working closely with the approach controller, they ensure

safety of airport operations by restricting traffic movements so that only one aircraft


13

may land or take-off at a time (Nolan, 1998; EUROCONTROL, 1999). In airports that

use multi-runway operations, the aerodrome controller may be responsible for all

runway operations. Otherwise, the responsibility for multi-runway operations may be

divided between a number of controllers. For example, a parallel runway configuration,

where one runway is dedicated to departures and the other to arrivals, requires

separate departure and arrival controller. In this case close cooperation between the

two controllers is essential to ensure a safe operation.

The aerodrome controller is responsible for all traffic operating in the designated area

of responsibility of the control tower. This includes aerodrome circuit traffic, aircraft

landing and taking off, and aircraft and vehicles operating on the manoeuvring areas

(ICAO, 2001a). When good visibility conditions prevail, (i.e. visual meteorological

conditions or VMC), the controller may separate the traffic by visual means and a

reduction in standard separation is permissible. When poor visibility conditions prevail

(i.e. instrument meteorological conditions or IMC) the aerodrome controller works in

close cooperation with the approach controller. In such conditions, prescribed

separation standards must be applied between aircraft in the air.

The surface movement control or ground control (in the US) is a supplementary service

to the aerodrome control service. In less busy airports the aerodrome and surface

movement control functions can be combined and provided by the aerodrome

controller. Otherwise, the surface controller is responsible for issuing taxi clearance

which will take all aircraft to the departure end of the runway (Nolan, 1998;

EUROCONTROL, 1999). In addition, the surface controller is responsible for the

movements of all aircraft and vehicular traffic on the manoeuvring areas of the airport.

ICAO (2001a) defines the manoeuvring areas as any part of the airport used for the

takeoff, landing, and taxiing of aircraft, excluding aprons. Surface movement control is

usually undertaken by visual means. However, in conditions of poor visibility the

controller relies upon surface movement radar (SMR). Working in close cooperation

with the aerodrome controller, the surface controller ensures that all active runways are

free from vehicular activity during aircraft movements.

2.3 Overall Air Traffic Control system architecture

The preceding paragraphs have highlighted the complexity of the ATM system and its

further decomposition down to the ATC system. Additionally, Figure 2-3 presents ATC

as a system comprised of people, equipment, and procedures integrated in an optimal

way to achieve a common objective. In order to understand how these components


14

come together, a more detailed explanation of the ATC architecture and its basic

functionalities is given below. In line with the objectives of the research presented in

this thesis, this section provides a deeper understanding of ATC functionalities and the

types of ATC equipment that can fail, and therefore affect controller recovery.

ATM

Airspace

management

(ASM)

Air Traffic Flow

Management

(ATFM)

Airborne ATM

(e.g. airborne

CNS, FMC,

ACAS/TCAS)

Ground-based

ATM

Air Traffic

Services (ATS)

Air Traffic Control

(ATC)

Air traffic services

Reporting Office

(ARO)

Flight Information

Service (FIS)

PEOPLE

Controllers

Engineers

Management

EQUIPMENT

HMI

Hardware

Software

PROCEDURES &

TRAINING

Operational Procedures

Engineering Procedures

Figure 2-3 ATM and ATC system components (adapted from ICAO, 2001a)

The functional architecture of any system presents a high level decomposition of the

overall system into a logical set of functional blocks. Each block may be further

decomposed into a series of sub-functions. The ATC functionalities and their related

sub-functions, as presented in this thesis, include all those of the current ATM/ATC

system as well those under development for inclusion in the future (i.e. with 2020 taken

as the target year in this thesis in line with the European Commission’s ‘Vision 2020’;

European Commission, 2001).

The starting point for the development of the ATC functional classification in this thesis

is the EUROCONTROL Harmonisation of European Incident Definition Initiative for

ATM (HEIDI) taxonomy. HEIDI taxonomy identifies six different ATC functionalities and

related ATC equipment that supports each of them. The functionalities listed in HEIDI

are: communication, surveillance, navigation, data processing and distribution, support

information functionality and power supply (EUROCONTROL, 2001e). This taxonomy

is subsequently expanded in this thesis by taking into account the needs for both the

classification and characteristics of the information derived from operational failure

reports processed. The analysis of operational failure reports highlighted the need for

nine ATC functional blocks. . The next set of layers dissects each ATC functional block


15

into relevant sub-functions which are then dissected further to the elemental level. This

approach enables the capture of all operational components of ATC. The resulting nine

ATC functional blocks, as defined in this thesis, are:

� Communication;

� Navigation;

� Surveillance;

� Data processing and distribution;

� Supporting;

� Safety nets;

� Power supply;

� Pointing and data input; and

� System monitoring and control.

Additionally, this classification is further built upon in Chapter 4. The following

paragraphs give a detailed description of each functionality and the corresponding

physical components (i.e. hardware components that support each function).

2.3.1 Air Traffic Control functionalities

2.3.1.1 Communication function

The scope of communication function covers the distribution of information to air- and

ground-based ATC system components in the form of voice, data, or both. This is

achieved using various communication methods. Currently, radio telephony (RT)

enables voice transfer of information via high frequencies (HF), very high frequencies

(VHF), and ultra-high frequencies (UHF). Controller-pilot data link communication

(CPDLC), as a concept currently used in Australasia and the Pacific, assumes transfer

of data based on high frequency data link (HF DL), very high frequency data link (VDL),

and satellite communication (SATCOM). In general, the communication function

provides connectivity and information transfer between users and providers that are

both internal and external to a particular ATC Centre. This function is supported by

various components (Figure 2-4) which are discussed in the following paragraphs. The

section concludes with a discussion of the future communication systems and the

concept of Required Communication Performance (RCP).


16

Figure 2-4 Communication function

Firstly, the communication function is supported by a Voice Switching Communication

System (VSCS) presented on Controller Working Positions (CWPs) via the VSCS

panel. This is a computer-controlled switching system that facilitates both the air-to-

ground (A/G) and ground-ground (G/G) communication necessary for ATC operations

(FAA, 1998). Controllers are able to use the VSCS for A/G communication by

accessing A/G transmitters and receivers through which they communicate with pilots

via HF, VHF, or UHF. The VSCS also ensures that incoming A/G communications from

pilots are routed to the appropriate control position. Controllers are able to use the

VSCS for G/G communication via intercom, interphone, and external circuits. Intercom

enables controllers to access other control positions or ancillary positions located within

the operational room. Interphone enables controllers to access positions located within

another ATC/ATM facility. Finally, external circuits of VSCS enable controllers to

access the public telephone network (FAA, 1998).

Secondly, data is exchanged with adjacent ATC Centres via the Aeronautical Fixed

Telecommunication Network (AFTN), On-line Data Exchange (OLDI) automated

protocols, and ICAO data interchange network, using both public and private telephone

networks. AFTN, administered by ICAO, is the means by which all information

concerning national and international air operations are exchanged. The data consists

of messages on aircraft movements, conditions of airports, weather, and other

information related to ATC. OLDI refers to operational use of connections between

various Flight Data Processing Systems (FDPS) at different Area Control Centres

(ACCs). Public and private telephone networks are used to communicate data on

individual flights between ATC Centres along the route of the flight. The data that is


17

exchanged includes flight level information, airspace boundary estimates of flights, and

other conditions that may be agreed between ATC Centres. This category incorporates

both systems for data exchange and any supporting equipment (e.g. AFTN printer,

console).

Thirdly, the Aeronautical Information System (AIS) provides information of a permanent

or semi-permanent nature on subjects such as geographical description of airspace, in-

flight procedures, sector procedures, communications data, surveillance data, and

specific airport characteristics data, either verbally or via datalink. In addition, local ATC

units provide a dynamic broadcast of relevant information to arriving and departing

pilots in the vicinity of the airport is known as Aerodrome Terminal Information Service

(ATIS). This service uses local weather data (from the meteorological office) and AIS

data (e.g. runway and taxiway conditions, navigational aids status).

Fourthly, backup radio and telephone systems must be provided. These backup

systems may provide identical functionality if it is a duplicated VSCS system. However,

in some cases, redundancy can be provided by similar but not identical systems which

cannot offer identical functionality. In these cases it is essential that controllers are

aware of these differences. Backup communication systems must be capable of

providing continuity of communication during outages (complete loss of the

communications at the level of an ATC Centre), as voice communication continues to

be the primary means of communicating ATC instructions to aircraft.

Finally, several other physical components are listed which have a role in providing the

overall communications function. These include but are not limited to pagers, headsets,

handsets, microphones, processors, press-to-talk buttons (PTT), buzzers, cables, and

footswitches.

The previous discussion has focused on current systems that support the

communication function. Current communication methods are mostly based on

analogue voice communication that pose various limitations to the users (e.g. limited

coverage, accessibility, capability, integrity, and security). Moreover, the combination of

these limitations with current Radio Telephony (RT) procedures is linked to excessive

levels of controller workload (see Figure 21 in EUROCONTROL, 2004g). As a result,

future development of air navigation for civil aviation aims toward enhanced

communication links between aircraft and controllers. This was an important element of

the ICAO’s Future Navigation Systems - FANS concept (ICAO, 2007). With respect to


18

communication, a major development has been the advent of the Required

Communications Performance (RCP) concept. This concept characterises the

performance requirements for communications with no specific reference to

technology. Hence, the concept allows various technologies to be evaluated in terms of

communication process time (i.e. delay), integrity, availability, and continuity of function

(NASA, 2000). Until 2015, it is anticipated that the voice communication function will be

supported by a very high frequency data link (VDL) in addition to existing analogue

voice channels. In general, voice communication will be used for real-time, time-critical,

and non-routine messages (i.e. radar vectoring to avoid traffic). All other, more routine

communications will be served via data communication supported by VDL and satellite

communication (SATCOM) (NASA, 2000). The use of enhanced modes of data link will

enable several advanced features. Firstly, it will bring automatic data entry capabilities

while reducing time spent on manual data entry and potential for data entry errors.

Secondly, it will permit a significant reduction in transmission time and thus reduce RT

frequency congestion. Finally, it will eliminate misunderstandings as a result of

broadcasting problems and language issues. As a result, communication in the 2020

time frame is expected to be characterised by a mix of analogue voice and digital

communication with increased use of datalink to complement or replace existing

analogue voice communications.

2.3.1.2 Navigation function

The main objective of the navigation function within air traffic control (ATC) is to provide

aircraft with the means to navigate between the point of departure and the point of

arrival, i.e. to accurately and reliably determine their position during all phases of flight.

The quality of required navigational information (e.g. accuracy and integrity of aircraft

position) differs based upon the phase of flight. For example, the requirements in the

landing phase of the flight are the most stringent due to proximity to the ground and

high speed of aircraft, leaving little time to pilot to take corrective action. The navigation

function block, as shown in Figure 2-5, focuses on three components, namely

approach and landing navigation systems, area navigation systems, and systems for

control and monitoring of ground-based airport facilities. These are explained in the

following sections, concluding with a discussion of the concept of Required Navigation

Performance (RNP).


19

Figure 2-5 Navigational function

2.3.1.2.1 Approach and landing navigation

This category within the navigation function consists of the systems that provide

precise guidance to an aircraft approaching a runway. The most widespread approach

aid is the Instrument Landing System (ILS) used for the most critical phases of the

flight, i.e. approach and landing. This system provides the pilot with both runway

centreline azimuth guidance (provided by an ILS localiser) and descent rate guidance

(provided by ILS glide slope) along the approach path of an aircraft. It allows pilots to

conduct the final approach and land safely even in conditions of poor visibility.

Previously, a Microwave Landing System (MLS) was supported by ICAO in areas

where it offered operational and economic advantages (e.g. increased runway

throughput/capacity). However, in this domain much more emphasis is now put on

evaluation of satellite navigation techniques and the necessary augmentations to

support precision landing with the long term objective of replacing the ILS system

(Aviation International News, 2001).

2.3.1.2.2 Area navigation

aRea NAVigation (RNAV) is a method of navigation that enables aircraft to fly any

chosen direct course within a network of navigation beacons, rather than navigating

directly to and from the individual beacons (EUROCONTROL, 2003h). Navigation

systems which provide RNAV capability include VHF Omni-directional Range/ Distance


20

Measuring Equipment (VOR/DME), DME/DME, Non-Directional Beacon (NDB), self-

contained Inertial Navigation Systems (INS), and Global Positioning System (GPS).

Currently, area navigation is primarily supported by ground-based systems. Most

widespread is the VOR which provides a radial or bearing on which aircraft fly from one

VOR station to another (EUROCONTROL, 2003g). This aid is usually combined with

DME providing information on the distance of the aircraft from the VOR/DME beacon.

Therefore, any aircraft utilising this facility, can determine its position in terms of

bearing and distance relative to the location of the VOR station. The VOR/DME

combination represents the primary ground based aid for area navigation. Generally,

the maximum range of VOR stations is in the region of 250nm due to the line-of-sight

nature of VHF signals and the curvature of the Earth (EUROCONTROL, 2003g). Each

air navigational service provider publishes the effective range of their VOR stations.

Another system that uses a radio beacon is a NDB. It consists of two components, the

Automatic Direction Finder (ADF) which represents the airborne component and the

NDB's transmitting unit which is the ground component. The NDB beacon broadcasts

continuously on a specific frequency. An ADF on the aircraft detects specific bearing to

or from an NDB unit and thus determines its position relative to the NDB beacon. A

NDB bearing is a line passing through the station that points in a specific direction (e.g.

270 degrees west). This system may also be coupled with a DME. Although widely

used in the approach environment, it is less accurate and less reliable than VOR/DME

since it is susceptible to interference from thunderstorms and other atmospheric

phenomena. The power output determines the maximum range of the NDB beacon but

generally they are usable in the range of 50-100 Nm (EUROCONTROL, 2003g).

An INS is a completely self-contained navigational system located on board the aircraft

and independent of ground-based navigation aids. The basic INS consists of three

mutually orthogonal gyroscopes, three mutually orthogonal accelerometers, a

navigation computer, and a clock (EUROCONTROL, 2003g). Gyroscopes are

instruments that provide the orientation of an object (e.g. aircraft’s angles of roll, pitch,

and yaw). Accelerometers sense a rate of movement or acceleration along a given

axis. The orthogonal accelerometer configuration provides three orthogonal

acceleration components. Combination of the gyroscope orientation information with

the summed accelerometer outputs yields the total acceleration in three-dimensional

airspace. A navigation computer then time integrates the total acceleration to get the

aircraft's velocity vector. This velocity vector is further time integrated, yielding the


21

position vector of aircraft. These steps are continuously iterated throughout the

duration of the flight. Based on all of the data, the INS system determines the aircraft’s

position relative to a known point of departure (i.e. latitude and longitude coordinates of

the departure gate).

In recent years, Global Navigation Satellite Systems (GNSS) are being slowly

introduced where appropriate and cost effective. Two GNSS systems are currently in

operation: the United States GPS and the Russian Federation’s GLObal NAvigation

Satellite System (GLONASS)3. A third, the European Galileo system, is scheduled to

become operational in 2010. Each of the GNSS systems uses a constellation of

orbiting satellites working in conjunction with a network of ground stations. The GPS

system is available for civil use based on 24 operational satellites. Two distinctive GPS

services are available, namely the Standard Positioning Service (SPS) and the more

accurate Precise Positioning Service (PPS). The SPS is available to the civil users

worldwide without charge or restriction, while the PPS is available primarily to the

military. The SPS requirements are defined through the service availability standard of

more than 99% of time at an average location, with an average accuracy of 34m

horizontal and 77m vertical (95% threshold) (Department of Defence, 2001; European

Commission, 2006a). Similar standards are defined for the Galileo system, where five

distinctive navigation services will be available namely Open Service (OS), Safety-of-

Life service (SoL), Commercial Service (CS), Public Regulated Service (PRS), and

Search And Rescue service (SAR) (European Commission, 2006b). The SoL service is

intended primarily for aircraft navigation. Service performance requirements for SoL

with dual frequency correction are set to be 4m horizontally and 8m vertically (95%

threshold) (European Commission, 2006b).

In recent years, additionally to the concept and supporting systems for area navigation,

a new concept referred to as Precision aRea NAVigation (PRNAV) has emerged.

PRNAV has been introduced to allow consistent terminal airspace operations in the

European region (i.e. European Civil Aviation Conference – ECAC member states).

This is based on the navigation requirements that procedures, design principles, and

aircraft capabilities should meet the accuracy of ±1 Nm for at least 95% of the flight

time (EUROCONTROL, 2006b).

3 ГЛОбальная НАвигационная Спутниковая Система (ГЛОНАСС) or Global'naya

Navigatsionnaya Sputnikovaya Sistema.


22

2.3.1.2.3 Systems for control and monitoring of ground-based airport facilities

In addition to all systems previously discussed, the navigation functional block also

includes systems for monitoring and control of ground-based airport facilities. Typically

monitoring and control of ground-based airport facilities is physically provided via

control desk with an interface panel designed to represent the airport facilities and

lighting services at a suitable scale (EUROCONTROL, 2003a). This component of the

navigation functional block supports but is not limited to the following elements:

navigational aids status, Aeronautical Ground Lighting (AGL) system (e.g. status of

runway, taxiway lighting panel), warning systems (e.g. runway in use), internal lighting,

meteorological equipment status, and alarming and reporting systems.

Finally, future development of air navigation for civil aviation aims toward enabling

aircraft navigation in four-dimensions seamlessly and gate-to-gate. The post FANS

Required Navigation Performance (RNP) concept is intended to characterise airspace

through a statement of the navigation performance accuracy (RNP type) to be

achieved (Jeppesen, 2001). In addition, the RNP-RNAV concept has emerged to

overcome the lack of harmonisation between the different RNP/RNAV naming

conventions and to enable common understanding of the relationship between RNP

and RNAV system functionality (ICAO, 2006a). The enhanced navigation, landing, and

surface movement service will be predominantly provided by the satellite-based

systems including the various augmentations such as Satellite-Based Augmentation

Systems (SBAS) and Ground-Based Augmentation Systems (GBAS). Surface

movements in all weather operations will be assisted with enhanced vision systems

enabling aircraft to ‘see’ the airport surface in reduced visibility conditions. As a result,

navigation in the 2020 time frame is expected to be characterised by a mix of ground-

and satellite-based systems with increased functionality complementing or replacing

the existing ground-based systems (VOR, NDB, DME).

2.3.1.3 Surveillance function

The ATC surveillance function identifies all aircraft and presents their position on a

radar screen. Additional dynamic information on the aircraft is also provided depending

on the type of radar employed. The surveillance function block, as shown in Figure 2-6,

focuses on radars, radar and auxiliary display, and radars used predominantly for the


23

terminal and ground surveillance4. The section concludes with a discussion of the

concept of Required Surveillance Performance (RSP).

Surveillance

Primary Radar

SSR Mode A/

C/S

Automatic Dependent

Surveillance (ADS)

Surface

Movement Radar

Parallel

Approach

Runway Monitor

Terminal

Approach Radar

Display

Aux Display

Precision

Approach Radar

Aerodrome

Traffic Monitor

Figure 2-6 Surveillance function

2.3.1.3.1 Radar systems

Basically there are two types of radar. The Primary Surveillance Radar (PSR) is the

most basic form of radar which transmits a pulsed beam of ultrahigh frequency radio

waves through 360 degrees via a rotating radar head (EUROCONTROL, 1999). When

the waves reach the aircraft, some of the energy is reflected back. Every time the

aircraft reflects the transmitted energy it will be displayed on the radar screen, thus

plotting the course of the aircraft. The PSR only displays an aircraft track or course and

does not provide any other dynamic flight data. This form of radar is rarely used for

commercial aviation except in underdeveloped regions or as a back up to secondary

surveillance radar.

Secondary surveillance radar (SSR) is a more sophisticated form of radar which does

not rely on reflected radio waves. SSR transmits electromagnetic waves in the form of

pulses through 360 degrees (EUROCONTROL, 1999). These pulses are received by

4 The primary difference between enroute radars and those used in the terminal and ground

surveillance is the rate of radar information update (e.g. enroute radars update every 8s, whilst terminal radars update every 5s; EUROCONTROL, 1997).


24

equipment on board the aircraft known as a transponder. The radar pulses interrogate

the transponder and if the transponder recognises the pulses it will respond by

transmitting back to the radar. Recognition is achieved by a discrete four digit code

assigned by ATC. When the transponder transmits to the radar, it actually transmits

essential data about the flight such as aircraft identification (known as Mode A) and

altitude (known as Mode C). As a result, the combination of the PSR and SSR Modes

A and C or SSR alone provides a three dimensional representation of the traffic. In

addition to this information, Mode S possess a data link functionality and access to

aircraft state vector (ground speed, track angle, turn rate, roll angle, climb rate,

magnetic heading, indicated air speed, mach number) as well as aircraft intent

information or indication of the future path (UK CAA, 2004).

A new surveillance initiative is directed toward the development of Automatic

Dependent Surveillance Broadcast (ADS-B) technology. This is a satellite-based

surveillance system that enables a constellation of satellites to determine the aircraft’s

position, altitude, velocity, and other parameters (CASA, 2006). The data is broadcast

to all possible recipients in contrast to Automatic Dependent Surveillance Contract

(ADS-C), where only point to point data transfer is established. As a result, surveillance

in the 2020 time frame is expected to be characterised by a mix of airborne (ADS,

ADS-B, ADS-C) and ground-based functions with increased functionality

complementing or replacing the existing ground-based systems (PSR and SSR).

2.3.1.3.2 Radar and auxiliary display

All surveillance information is presented to controllers on the Human Machine Interface

(HMI) commonly known as air situational display or radar display. Therefore, this

component of surveillance function block includes both radar and auxiliary displays.

Auxiliary display acts as a support providing data such as flight plan data, traffic lists,

and static and dynamic aeronautical data (e.g. notification to airmen - NOTAMs,

meteorological messages, and airport related information).

2.3.1.3.3 Terminal and ground surveillance

The surveillance functional block also incorporates radar systems which are relevant to

terminal and ground surveillance (Figure 2-6). These are Surface Movement Radar

(SMR), Parallel Approach Runway Monitor (PARM), Terminal Approach Radar (TAR),

Precision Approach Radar (PAR), and Aerodrome Traffic Monitor (ATM).


25

Finally, future development of air navigation for civil aviation is focused on increased

accuracy of the aircraft position by integrating data from all available sources, such as

primary and secondary surveillance signals and Automatic Dependence Surveillance

Broadcast - ADS-B (Mohleji, Lacher, and Ostwald, 2003). The Required Surveillance

Performance (RSP) defines the surveillance requirements according to the airspace

involved (e.g. oceanic/remote airspace vs. high density traffic airspace). In addition, the

ADS system will enable merging of communications, navigation, and surveillance

technologies. This will accelerate the movement toward Airborne Surveillance and

Separation Assurance (ASAS). In other words, the future surveillance technologies

(e.g. ADS) will enable pilots to participate actively in the process of safely separating

their flight from other flights. This will be achieved by the display of traffic information

within the cockpit, wake vortex hazard prediction and avoidance, three dimensional

terrain presentation, terrain avoidance system, and weather awareness (Ochieng,

2006). Moreover, the US FAA is developing a concept of Situational Awareness for

Safety (SAS). The SAS concept is based on the use of available data (e.g. satellite-

based position data, terrain, weather) and their exchange between all parties involved

(e.g. pilots, dispatchers, controllers). The primary objective of the SAS concept is to

create an environment promoting more efficient, safe, and free use of airspace (FAA,

1995).

2.3.1.4 Data processing and distribution function

The data processing and distribution function incorporates all systems required to

process flight related data (e.g. initial flight plan data, dynamic communication,

navigation, and surveillance flight data). These include the Flight Data Processing

System (FDPS) as well as the Radar Data Processing System (RDPS) enabling

controllers to 'see' in real-time the movement of aircraft in a dedicated airspace, as

represented on radar display. In addition, this function block also incorporates all

supporting equipment, such as strip printer (Figure 2-7).


26

Data Processing and

Distribution

Supporting

equipmentFlight Data

Processing

System

Radar Data

Processing

System

Single Radar

Processing

Multiple Radar

Processing

Fallback Radar

Data Processing

System

Fallback Flight

Data Processing

System Flight plan processing

Airspace data processing

Flight data management

& distribution

SSR management

MTCD

Trajectory prediction

MAESTRO

Figure 2-7 Data processing and distribution function

The FDPS handles flight plans and updates them through automatic events, manual

inputs, and triggered transitions from one state to another. This life of a flight plan

represents the condition of the flight plan at a specific time in its cycle. The phase of

the flight plan life cycle triggers certain system actions and directly affects what actions

the controller can take on the flight plan and therefore the actual flight. Through the

processing of flight progress strip (either manually or electronically), the controller

manages all traffic by interacting with flight related data (on the radar and auxiliary

display, and strip management board). The FDPS carries out the following specific

processes (EUROCONTROL, 2003a):

� initial flight plan processing which includes checking incoming flight plan

messages, creating a record of flight data, and storing it in the flight plan

database. In addition, the FDPS handles flight data throughout the ‘life’ of the

flight plan by constantly updating and distributing the flight data;

� airspace data processing and distribution which handles the complete airspace

information (e.g. airways and navigation beacons). In addition, it processes any

information on the special use of airspace to warn the controller about

infringements which require modification of flight trajectory;

� meteorological data processing and distribution;

� SSR code management which involves the assignment of SSR code to flights

and identification of all flights by SSR mode A. It also prevents assignment of

duplicate codes;

� trajectory prediction which is performed throughout the flight plan life cycle, taking

into account the initial flight plan as well as all modifications of the route;


27

� provision of system supported coordination and transfer of control within the ATC

Centre and between adjacent ATC Centres;

� processing of data link messages from/to the aircraft (A/G coordination);

� flight plan conflict detection which is performed inside a defined region (i.e.

sector) using flight plan data. This function is known as Medium Term Conflict

Detection (MTCD);

� workload monitoring and distribution essential for assisting the supervisor in the

adjustment of the existing sectorisation (i.e. collapse/de-collapse of sectors) and

computation of position/sector load;

� arrival sequencing which provides the approach and en-route controllers with a

proposed sequence number for each arrival flight; and

� establishment of code/callsign correlation as a mapping between radar tracks

and flight plan database.

A flight progress strip is a tool that controllers use to record the progress of each flight

as it moves through the sector. It represents a record of all ATC instructions given to

each aircraft. It is also used as a back up to the surveillance function in the event of a

failure. The flight strip printer facility, as an additional component in this functional

block, supports the printing of flight strips at the executive, planner, and/or flight data

assistant positions, depending on the suite configuration. This facility automates the

previous manual filling of a flight strip through access to a database of flight information

and a printout of the data when needed. The printed strip displays the non-dynamic

aspects of the flight, necessitating only tactical dynamic instructions to be manually

entered on the strip by the controller.

The RDPS processes radar pictures from all available sources (primary and secondary,

short range and long range, en-route and approach radars) to establish an accurate

picture of all traffic over a well-defined geographical area. In the case of multiple radar

coverage, the RDPS provides a composite air picture of the traffic while taking into

account radar biases for range and azimuth measurements (EUROCONTROL, 2003a).

The ATM surveillance tracker and server system (ARTAS) processes PSR, SSR, Mode

S, and ADS data. These highly accurate and reliable data are directly integrated into

the existing ATC environment by using a universal data exchange format. For example,

EUROCONTROL defined the All Purpose STructured Eurocontrol Radar Information

Exchange (ASTERIX) messaging format. This allows the transfer of information

between two parties (e.g. systems) using a mutually agreed format of data.


28

The data processing and distribution functional block also incorporates both a fallback

flight data processing system and fallback radar data processing system, as necessary

redundant systems in every ATC Centre. These fallback systems may provide identical

functionality if they are duplicates of the FDPS and RDPS systems. However, in some

cases these fallback systems do not necessarily provide the same range of functions

as the main systems. The necessity of redundant systems in ATC is discussed further

in Chapter 4.

2.3.1.5 Supporting function

The supporting function comprises various ATC tools that enable integrated air traffic

management operations that enhance safety and increase airspace capacity. The main

objective of these tools is to lessen the cognitive workload on the controller while

focusing on the relevant (task specific) information (IFATCA, 2004). They also assist in

the detection and resolution of potential problems. It is important to note that these

tools do not replace the need for controller decision making processes, they simply aid

them. The supporting function includes the following tools (Figure 2-8):

� Monitoring tools assist with detection and recording of any safety-related events

(e.g. the Automatic Safety Monitoring Tool – ASMT), reduce the workload

associated with traffic monitoring tasks by identifying the potential and actual

deviations or non-conformance with the planned flight trajectory (e.g. MONitoring

Aid – MONA), and automatically check if aircraft are adhering to their planned

route (e.g. Route Adherence Monitoring – RAM) or cleared flight level (e.g.

Cleared Level Adherence Monitoring – CLAM) by comparing ‘planned’ or

‘cleared’ information with the aircraft actual position (EUROCONTROL, 2001f);

� The Medium Term Conflict Detection (MTCD) system is a tool which enables

controllers to predict and identify future conflict between aircraft in the predefined

region by applying separation rules (EUROCONTROL, 2001f); and

� Sequencing managers (e.g. Arrival Manager - AMAN, Departure Manager -

DMAN, Means to Aid Expedition and Sequencing of Traffic with Research and

Optimisation - MAESTRO) are decision making tools for providing the approach

and en-route controllers with the control and sequencing actions to properly

expedite traffic to the destination airports and runways (EUROCONTROL, 2001f).


29

Figure 2-8 Supporting function

These tools aim to enhance the controller’s appreciation of the current and predicted

traffic situation and facilitate the decision making process. They are an integral part of

the HMI (i.e. radar display) and are informed by the output of the data processing and

distribution function.

2.3.1.6 Safety Nets

A safety net (SNET) is an airborne and/or ground-based function informing the pilot or

controller to the imminent possibility of collision between aircraft, between aircraft and

terrain/obstacles, as well as penetration of dangerous airspace (IFATCA, 2004). The

most common safety nets are Short Term Conflict Detection (STCA), Minimum Safe

Altitude Warnings (MSAW), Area Proximity Warnings (APW), and Runway Incursion

Monitoring and Conflict Alert System (RIMCAS).

The previous section described medium term conflict detection (MTCD) as an ATC tool

which assists the controllers in early detection and prediction of conflicts (e.g. 20

minutes in advance). Similarly, the STCA function detects two system tracks predicted

to be in conflict (i.e. two tracks where both horizontal and vertical separations are about

to be compromised). This system then alerts the controller to the imminence of a

separation minima infringement through the display of visual alarms presented on the

affected traffic on the HMI. However, whilst MTCD is for early detection and prediction

of conflicts, the STCA is used as a safety net or defence against imminent conflict

(EUROCONTROL, 2007a). The exact moment of STCA alarm depends upon


30

predetermined settings (usually it is set to trigger the alert between 90 seconds and two

minutes prior to conflict).

The MSAW function enables detection of a radar track predicted to infringe the

minimum safe altitude above an obstacle. MSAW processing takes into account the

track altitude (i.e. altitude of the track extracted from Mode C or present altitude

corrected for pressure at mean sea level known as QNH pressure, thus providing the

altitude above mean sea level), attitude indicator (i.e. climb or descent), position and

speed vector. In addition, the system will detect if a radar track is predicted to deviate

from the approach path of an airport (EUROCONTROL, 2007a).

The APW is used to designate areas which are dangerous for an aircraft to enter (e.g.

missile firing, military training, and air display areas). These areas can be identified as:

prohibited, restricted, dangerous, military training, segregated, special use, temporary

restricted, and permanently restricted. The APW ensures that any aircraft infringing or

predicted to infringe on one of these areas is detected by this system and an advance

warning is presented to the controllers (EUROCONTROL, 2007a).

RIMCAS is an airport monitoring and conflict alert system which detects and alerts

controllers before a runway incursion is about to occur. The system gives the controller

an opportunity to react within a realistic and effective timeframe. This system is also

known as the ground short term conflict alert system. The main requirement of this

system is to be supplied with reliable surveillance data as any false alert unnecessarily

increases controller workload. As a result, the Automatic Dependent Surveillance

Broadcast (ADS-B) system should enhance surveillance capability for airport

monitoring and conflict prevention through the Advanced Surface Movement Guidance

and Control Systems (ASMGCS) (ICAO, 2005).

2.3.1.7 Power supply

The availability of electrical power is a prerequisite in a computer driven environment,

such as an ATC Centre. Electrical power is obtained from public utilities, but in case of

interruptions or non-availability, the ATC Centre's own installations are required to

provide electrical power. This is most commonly achieved by diesel-powered

generators or powerful batteries, supporting an Uninterrupted Power Supply (UPS)

capability. These components are required to provide uninterrupted electrical power

supply in order to prevent computers shutting down.


31

2.3.1.8 Pointing and input devices

The Human Machine Interface (HMI) represents the entire ATC system to the controller

on each Controller Working Position (CWP). In order to interact with available systems,

the controller uses input and pointing devices. Input devices include Touch Input

Panels (TIP), the mouse, and keyboard. However the most frequent pointing devices

are the mouse and trackerball. Using the input and pointing devices, the controller

‘communicates’ with the entire ATC system, and edits and reads ‘live’ flight plans. All

the changes and interactions made by controllers via input and pointing devices are

presented on displays (i.e. radar, auxiliary display, and communication panel).

2.3.1.9 System control and monitoring function

This function is supported by a computer and monitor system that controls the overall

ATC system from a centralised position, i.e. the system control and monitoring unit.

The main purpose of this system is to display the actual state of the core systems and

subsystems within the CNS/ATM infrastructure, to manage incidents, and to perform

the reconfiguration of resources within its infrastructure. This functional block

constantly checks the functionality of the overall system, involving the software and

hardware configuration in order to ensure a high system availability (EUROCONTROL,

2003a). The system monitoring and control functionality is supported by several

different facilities which are explained in the following paragraphs (Figure 2-9).

Figure 2-9 System monitoring and control function

The data recording and playback facility enables automatic recording of all transactions

made by the radar data, flight data, radar display, and communication functions. This

includes all controllers’ modifications to flight plans, received messages, and display

setting modifications (EUROCONTROL, 2003a). The recorded data are used for further

data analysis and for playback of the specific air traffic situation (i.e. in the case of an


32

incident). The recordings are stored on disks for the time deemed necessary by the

relevant aviation authority (the legal requirement is 30 days but could be longer if

necessary for incident investigation).

One of the most requested system control and monitoring functions is the ability to

detect faults in the supervised ATC system by continuous control and monitoring of the

system operation. This facility provides detailed information on the equipment states

within the managed systems and the relevant alarm conditions which may affect the

operating mode. It also logs events and enables the remote control of supervised

equipment and setting of the system thresholds (EUROCONTROL, 2003a). Its main

sub-functions are: fault management (i.e. alarm management, threshold setting),

configuration management (i.e. equipment descriptions), performance management

(i.e. identification of trends and problems), and security management (i.e.

authentication, identification, password protection, tailored user interface). The control

and monitoring is performed on all positions, external lines, and connections.

Each ATC system is designed to have several operational system modes

(EUROCONTROL, 2003a). These modes automatically switch-in if any of the major

processing systems fail. The objective is that the controller always has some

functionality available despite the degradation of equipment. Reduced radar, alert, flight

plan, and communication modes are the most frequent types of reduced operational

modes available in current ATC systems.

The time management facility uses the external time received from the GPS signal for

synchronising time on all computers (i.e. all Controller Working Positions - CWPs). The

time is expressed in Coordinated Universal Time (UTC), also known as zulu time.

Originally, it was a time scale based on the local standard time on the 0° longitude

meridian which runs through Greenwich, United Kingdom. Today, UTC uses precise

atomic clocks and satellites to ensure a reliable and accurate time standard for air and

ground operations (ICAO, 1979).

2.4 Characteristics of the generic Air Traffic Control Centre

The preceding paragraphs presented the architecture (functional and physical) of an

Air Traffic Control (ATC) system. However, a more complete understanding of the ATC

system (i.e. people, equipment, procedures) is possible within the context of an ATC

Centre providing specific types of services. Therefore, this section reviews the main

characteristics of a ‘generic’ ATC Centre with particular focus on current technologies.


33

The following section focuses on technologies that will determine the characteristics of

the generic ATC Centre in the future.

There are significant variations in equipment between ATC Centres, both in Europe

and worldwide. On the European level, EUROCONTROL, the European Organisation

for Safety of Air Navigation, took the role of promoting the harmonisation, integration,

and standardisation while improving safety and overall performance of the ATM/ATC

systems in its member states. For example, EUROCONTROL (2006d) has considered

the costs of fragmentation of the EUROPEAN ATM system. At a global level, ICAO

standardisation activities are undertaken when new systems or technologies are

mature, have demonstrated their ability to provide safety enhancements compared to

existing systems, and are cost beneficial to international civil aviation (ICAO, 2003).

ICAO has established standards and recommended practices for all of its contracting

states (ICAO, 2006b).

In spite of the significant effort to date to standardise ATM/ATC within the aviation

community, there are still significant differences. For this reason, the methodology

adopted in this thesis for the assessment of controller recovery from equipment failures

in ATC is designed on the basis of a ‘generic’ ATC Centre. This is defined below.

The ATC Centre should be based on a fully automated and integrated system with a

fail-safe design based on duplicated processors and open architecture in accordance

with existing industrial standards. It also has to have graceful degradation modes. The

data processing functional block should be able to support acquisition and processing

of data from several radars (i.e. multiradar tracking), automatic collection and

processing of flight plans, automatic allocation of SSR codes, coordination achieved

through direct connection to adjacent centres (e.g. on-line data exchange - OLDI),

coordination of civil and military flights via a separate military suite, and automatic flight

progress monitoring (continuous calculation of flight profile and update based on radar

data). The air situational picture should be presented on the HMI (radar and auxiliary

display) with necessary alert facilities (e.g. STCA, MSAW, CLAM, RAM). The playback

function of radar pictures should be available for incident investigation, testing,

development, and training.

The ATC Centre should have the capability to have paper strip presentation on the strip

console. A flight progress strip is a single strip of paper that contains all information on

a flight and its evolution through a particular sector of airspace. It is used as a quick


34

way to record the progress of the flight and to keep a legal record of the instructions

issued. It is also used to allow the planning controller to predict future conflicts and to

ensure that sector entry/exit conditions are achieved. In addition, in the case of radar

failure, flight progress strips represent the primary control tool. The strip, mounted in a

strip holder, is placed with other strips in a 'strip board' which displays all flights in a

particular sector of airspace or on an airport.

In recent years, there have been initiatives aimed at electronic strip presentation, used

in many European ATC Centres and airports. However, as Lanzi and Marti (2001) point

out, controllers do not generally find electronic strips to have the same level of flexibility

and support as paper strips. On the other hand, more radical attempts have been made

toward a stripless environment, where aircraft information is tagged to the label on the

radar screen that can be expanded as necessary. In this environment generally three

modes of the same aircraft label exist: the standard label that is always displayed on

the screen, the highlighted label that is bigger and contains more information, and the

extended label that contains all information not immediately required by the controller

(for details see Lanzi and Marti, 2001).

The previous sections have discussed the current technologies relevant to an ATC

Centre. This forms a part of the definition of a ‘generic’ ATC Centre. In addition, the

generic ATC Centre should be adaptable to changes in technologies. Hence, the

following section addresses the future of ATC and how this is likely to impact on an

ATC Centre.

2.5 The future of Air Traffic Control

The research presented in this thesis has to take into account the future challenges

that may face controllers with the increased exposure to more automated systems. In

this regard, this section briefly discusses the key challenges of automation,

characteristics of human-centred design, as well as the concept behind the ICAO’s

Future of Air Navigation Service (FANS). The section concludes with a discussion of

the potential sources of technical and human performance deficiencies within the future

ATC Centres and their relevance to the equipment failures and the recovery process as

investigated in this thesis.

2.5.1 Challenges of automation

There are various definitions of automation, residing in different contexts. In the context

of Air Traffic Management (ATM), the National Research Council Panel on Human


35

Factors in Air Traffic Control Automation (Wickens et al., 1998) defined automation as:

“a device or system that accomplishes (partially or fully) a function that was previously

carried out (partially or fully) by a human operator.”

According to Wickens (1992) automation is mainly applied to perform or assist

functions in which humans are naturally limited (e.g. accessibility to toxic, dangerous,

unreachable environments; or inherent working memory limitation). In addition,

automation is used to replace humans in operations which are time consuming, costly,

or induce high workload (e.g. complex monitoring or analytical processes). While often

seen as replacing humans, in reality, automation changes the role of the human

operator from direct manual control to largely supervisory control. In other words, in this

new role, the human operator plans and inputs tasks and the computer systems

implement these tasks automatically. Automation does not totally replace human

activity, it just changes the nature of the work that humans do. This change is often

completely unintended or unexpected by automation designers (Parasuraman and

Riley, 1997).

Past research has identified three sources of human performance deficiencies when

using high level automation (Bainbridge, 1983; Wickens et al., 1998; Wiener and Curry,

1980; Boehm-Davis et al, 1983). Firstly, humans become less likely to detect failures in

the automation itself or in the automated process. Secondly, they lose some

awareness of the state of the automated process. Finally, human operators eventually

lose skills in performing the actions manually if these actions have been previously

automated. These three phenomena are commonly known in literature as ‘out of the

loop’ performance problems. This problem of deterioration of manual skills is

particularly relevant to controllers and flight crews. As Bainbridge (1983) points out, an

irony is that the more reliable the automation, the more prone to ‘out of the loop’

performance problems will be the operator. This is the direct result of the increased

complacency, over trust in automation, and deterioration of manual skills of both

controllers and pilots.

Experiments have shown that operators abilities to recover from emergency

automation failure significantly improve with levels of automation that require human

involvement in the implementation of a task. Thus automation strategies that allow

operators to focus on current operations may contribute to improved situational

awareness and reduction in workload (Endsley, 1997). As a result, a new approach to


36

automation evolved resulting in human-centred designs instead of technology- or

automation-centred designs.

2.5.2 Human-centred vs. technology-centred automation

Traditionally, automation was perceived in an all-or-none fashion. At one extreme,

automation was employed completely and expected to eliminate human error. At the

other extreme, automation was kept to an absolute minimum, keeping the operator as

much as possible in the control loop. This traditional approach to automation has been

known as ‘static’, where the level of automated assistance was unchanged over time

(Parasuraman et al., 1990). However, decades of research showed that between these

two extremes, different levels of automation can be specified by the degree to which a

task is automated. This way of thinking led to a concept of human-centred automation

which is essentially developed around the idea to keep the operator in control of the

situation (Billings, 1996; Parasuraman et al.; 1990; Sheridan, 1980). As Layton, et al.

(1994) note, the design of any automated system should be seen as the design of a

new collaboration between the machine and the human operator.

According to Wickens et al. (1998) the choice of what to automate should be simply

guided by the need to compensate for human vulnerabilities and to exploit human

strengths. However, this simplistic approach may again lead to static automation, not

exploiting and adapting automation to the characteristics of the context (surrounding

the human operator). Therefore, it seems more reasonable to move beyond traditional

automation approaches toward the principles of dynamic allocation of control between

human and machine, i.e. ‘adaptive automation’ (Scerbo, 2005; Kaber, 1997; Kaber and

Riley, 1999; Parasuraman et al., 1996; Parasuraman et al., 2000; Kaber, Prinzel,

Wright, and Clamann, 2002).

In short, the presence of automation is inevitable in all future concepts of air navigation.

Current design initiatives are more focused on the human-centred automation while

initial steps have began to be taken toward adaptive automation. For example, the

concept of cognitively convenient alarm onset has been tested on a US naval ship as

described in Daniels, et al. (2002). Based on the previous discussion on the main

principles of automation, it is necessary to review how these principles are

implemented in the design of future ATC systems and tools. The following section

presents the key concepts that will signify the characteristics of the Communication

Navigation and Surveillance (CNS/ATM) up to the year 2020.


37

2.5.3 The future of air navigation service

The problems with the current air traffic management system can be summarised in

two areas. Firstly, the fragmentation of national systems prevents optimal use of global

airspace, as aircraft have to be controlled by many different air traffic systems.

Secondly, inherent limitations of current Air Traffic Control (ATC) technologies and

operational procedures are well known and make it impossible to achieve enhanced

efficiency and required capacity for the future (Ochieng, 2006).

To respond to the identified areas of concern, the International Civil Aviation

Organisation (ICAO) developed the Future Navigation Systems (FANS) concept built

around Communications, Navigation, and Surveillance in Air Traffic Management

(CNS/ATM) system. As a result, future concepts and strategies in ATM/ATC will follow

a global approach to ATM and no longer focus solely on national needs. In this overall

environment, ATM/ATC technologies will face necessary changes and development

currently under conceptual or design phase. The general drivers of future ATM/ATC

are structured around communication, navigation, and surveillance functionalities and

are summarised below:

� communication in the 2020 time frame is expected to be characterised by a mix

of analogue voice and digital communication with increased use of datalink (VHF

based datalink-VDL, SSR Mode S datalink) and satellite communication

(SATCOM) to complement or replace existing analogue voice communications.

� navigation in the 2020 time frame is expected to be characterised by a mix of

ground- and satellite-based systems with increased use of satellite systems (e.g.

GPS, Galileo) for all phases of flight.

� surveillance in the 2020 time frame is expected to be characterised by a mix of

airborne (ADS, ADS-B, ADS-C, A-SMGCS, cockpit situational awareness-SAS)

and ground-based functions (SSR Mode S) with increased functionality

complementing or replacing the existing ground-based systems (PSR and SSR).

This succinct statement of the evolution of CNS/ATM within 2020 time frame needs to

be further discussed from the perspective of a generic ATC Centre. In other words, it is

necessary to discuss the potential characteristics of the generic ATC Centre in 2020.

Based on ICAO and EUROCONTROL future concepts, the following changes are

expected in the generic ATC Centre in 2020:

� in support to Gate to Gate (G2G) flight management the following ATC systems

and tools are proposed for the period from 2010 onwards: four dimensional flight


38

trajectory prediction, sequencing managers (AMAN, DMAN), MTCD, monitoring

aid (MONA), system supported coordination (SYSCO);

� stripless environment;

� datalink communication;

� autonomous or free flight concept less reliant on ground-based navigational aids;

� transfer of separation responsibility to the flight deck giving controllers more of

a monitoring role;

� electronic (silent) coordination; and

� dynamic optimisation of airspace through the Single European Sky (SES)

initiative (EUROCONTROL, 2007b) and the concept of flexible use of airspace

(see MANTAS concept; EUROCONTROL, 2004b).

After presenting the system design principles and characteristics of future ATM/ATC, it

is important to discuss the impact that those changes may have on equipment and

human reliability. Following the main objective of the research presented in this thesis,

it is necessary to identify the potential sources of technical and human performance

deficiencies and their relevance to the controller recovery process.

2.5.4 Impact of future ATM/ATC on controller recovery from equipment failures

With the accumulated knowledge of the modern integrated ATC systems, it is

reasonable to assume that future overall equipment reliability will remain similar to

current standards. However, the nature and types of equipment failure may change.

While eliminating single-points failure, future ATC Centres may experience increased

problems with software reliability and data integrity (e.g. presentation of inaccurate

data). This will be the direct result of a more complex and integrated ATC architecture

as well as incompatibility between current and future, more automated ATC equipment.

In other words, the future ATC Centres may be faced with failure types that will be

harder to detect and repair. The highly integrated ATC architecture may mask some of

these failures and hide the real cause(s) of the problem.

When discussing human reliability issues in future ATM/ATC environment, it is

reasonable to assume that automation design will create situations where controllers

will not be able to cope with its complexity or simply will not have enough time

available. This is a direct result of the assumed ‘out of the loop’ performance and the

reduced separation between aircraft (as a requirement to provide necessary capacity).

As noted by Wickens et al. (1998), the time available to safely respond to an


39

emergency situation will decrease with decreased separation, while the operator

response time may increase due to ‘out of the loop’ performance. One alleviating factor

may be the transfer of responsibility for separation management from controllers to

pilots, giving the former more time to affect recovery. The environment of collaborative

decision-making and real-time information exchange though threatens to distribute

false or inaccurate information from the ground to the air. In this case, ATC equipment

failure may affect the airborne segment of ATM and cockpit instruments (e.g. Flight

Management System - FMS).

The European Organisation for Safety of Air Navigation (EUROCONTROL) recognised

that the role and nature of controller tasks will change as a result of the addition of

increased automation within the ATM system. As a result, they initiated the Solutions

for Human-Automation Partnerships in European ATM (SHAPE) project to better

understand interactions between automated support and controllers (EUROCONTROL,

2004f). SHAPE has identified seven factors that need to be addressed to ensure

harmonisation between automated support and the controller. Amongst factors such as

trust, situational awareness, team issues, skills, ageing, and workload, SHAPE

recognised the importance of managing system disturbances (details are presented in

Chapters 5 and 7). As a result, the assessment of controller recovery presented in the

remainder of this thesis, considers the interactions between human and automation. A

flexible approach has been developed to assess controller recovery in any possible

context.

In short, the role of the human operator will remain significant in the future ATC

environment. Due to the transfer of responsibility for separation management from

controllers to pilots the recovery performance will evolve from purely controller’s

actions to collaboration between controller and pilot. To support human performance in

the future more automated environment (both on the ground and in the cockpit), special

attention will have to be given to the areas of human-computer interaction, training, and

procedures for both normal and abnormal situations.

2.6 Summary

The aim of this Chapter is to create a basis for the research on recovery from

equipment failures in ATC. There are several findings that will be taken forward from

this Chapter. Firstly, this Chapter defined ATM and its component ATC and thus

indicated the scope of the research presented in this thesis. Secondly, this Chapter

placed additional emphasis on the ATC functional classification. This classifications


40

starts with the main ATC functional blocks further dissected to element level. It has

been defined based on both current and future ATC systems and tools in accordance

with principles and initiatives of ICAO and EUROCONTROL. As such, this ATC

functional breakdown is flexible to changes in ATM/ATC and should capture both

current and future equipment failure types. Finally, this Chapter defined characteristics

of a ‘generic’ ATC Centre in both current and future ATC environment. This finding

creates a base for the entire research presented in this thesis.

The next Chapter focuses more on the equipment component of the ATC system.

Since the aim of the overall thesis is to assess the impact of equipment failures, the

next Chapters provide relevant definitions, identify types of equipment failure, and their

contribution to the safety of the overall air transport system. A sample of operational

failure reports used in this research is validated through a framework based on the

contribution of equipment failures to the overall safety of air transport system.

Chapter 3 Preliminary Assessment

41

3 Preliminary Assessment of Equipment Failures in Air Traffic Control

The previous Chapter presented the context of the research in this thesis by describing

the Air Traffic Management (ATM) system and its component the Air Traffic Control

(ATC) system. Furthermore, it detailed the range of functions provided in an ATC

Centre. The main characteristics of current ATC Centres as well as the concepts

shaping their future characteristics were covered also. A comprehensive analysis of

equipment failure should follow its ‘life’ by assessing all the phases that this occurrence

undergoes throughout the ATC system (Figure 3-1). An equipment failure firstly

encounters the existing technical built-in defences. If these inherent defences are

insufficient to prevent the failure impacting on the ATC system, the failure now

becomes a hazard. Hazards represent a sub-group of equipment failures that penetrate

existing technical built-in defences and hence require human intervention (or human

recovery). An equipment failure occurrence concludes with the outcome which is the

result of the collaboration between technical and human recovery.

Figure 0-1 Phases of an equipment failure occurrence

Following the equipment failure ‘life’, the Chapter starts with the relevant definitions of

equipment failures and hazards. While the human recovery and outcome phases of the

equipment failure ‘life’ are discussed in the remainder of the thesis, this Chapter

continues by presenting the available sample of operational failure reports. It also

discusses the reporting schemes used to obtain equipment failure reports and data

pre-processing issues. The appropriateness of this sample is assessed by using a


42

methodology that determines how much ATC equipment contributes to the safety of the

overall air transport system. Agreement between the findings obtained from past

research and the analysis of available operational failure reports indicates the validity

of this sample. Once this is achieved, the thesis continues with more in depth

assessment of the available sample in the following Chapter.

3.1 Definition of equipment failure

The focus of aviation safety and reliability management has mainly been on the

prevention of technical failures, human failures (also known as human errors), and

more recently organisational or management failures (Reason, 1997). The European

Organisation for Safety of Air Navigation (EUROCONTROL) defines failures in the ATC

system as “the inability of any element of that system to perform its intended function or

to perform it correctly within specified limits” (EUROCONTROL, 2002c). As discussed

in Chapter 2, the ATC system comprises of people, equipment, and procedures

integrated in an optimal way to achieve a common objective. However, the research

presented in this thesis focuses solely on failures of one component of ATC system,

namely equipment. Therefore, in the following text, the term ‘failure’ will only apply to

equipment failures or malfunctions.

Leveson (1995) defines failure as the “inability of the system or component to perform

its intended function for a specified time under specified environmental conditions”. The

definitions by Leveson and EUROCONTROL are similar as both take into account

failure in a much wider sense. In this research a failure occurs when any component of

ATC equipment terminates unexpectedly and no longer performs the required function,

while the overall ATC system remains operational. If the entire ATC system becomes

unavailable, the failure is known as an outage. For example, communication failure is

observable in an ATC Centre if there is unexpected failure of radio communication

equipment on one console. However, if the failure affects the entire ATC Centre (e.g.

due to loss of power), this failure is known as an outage. It is important not to restrict

the term failure only to catastrophic events. Small-scale failures can combine to act

more severely in different environmental conditions (contexts). According to Wickens et

al. (1998) the source of such problems could be software bugs, erroneous or delayed

data exchange, or design deficiencies. Figure 3-2 illustrates the definitions discussed

previously.


43

Air Traffic Control

(ATC) System

PEOPLE EQUIPMENTPROCEDURES

& TRAINING

FAILURE

HUMAN FAILURE =

HUMAN ERROR

EQUIPMENT

FAILURE

FAILURE OF

PROCEDURE AND/

OR TRAINING

Equipment

failure

Outage or

Fallback

Local impact: console/sector

Overall impact: entire ATC

Centre

Failure modeFailure effect observable on

equipment and/or ATC system

Figure 0-2 Different definitions

In a similar way, it is necessary to differentiate between total and partial equipment

failures. Using the example above, a total radio communications failure will result in a

situation where a controller working position (or a sector) can no longer provide air

traffic services due to the inability to communicate clearances or instructions to aircraft.

However, if a failure affects only one element, either the transmitter or receiver, and the

other component is still operational on that position (or the sector), the radio

communication failure will be regarded as partial. In other words, if the equipment no

longer performs any aspect of the required function the failure is total, but if at least

some portion of the required functionality still exists, the failure is only partial.

All technical items are designed to fulfil one or more functions. A failure mode is thus

defined as an inability to partially or completely fulfil one of these functions (Figure 3-2).

It is also defined as the visible effect of a failure on the ATC system. Note that

equipment failures may not have any visible impact on the ATC service due to the

availability and effectiveness of built-in defences (e.g. redundancy) discussed in more

detail in Chapter 4. In this case, the only visible effect on the system (i.e. failure mode)

would be the engagement of the first level of redundancy. In some cases, this transition

is done seamlessly and it is only apparent to technical staff, but not to controllers. The


44

UK national air navigation service provider (NATS) differentiates between fallback and

failure modes. According to NATS, fallback mode is a condition which occurs only if

there is a major failure or when the level of redundancy is significantly eroded (NATS,

2002). Thus, the NATS definition of fallback modes corresponds closely to outages

defined previously.

It is very important to distinguish between equipment failures and human operator

failures, known as human errors (Figure 3-2). Note that it could be said that all failures

are human in their nature, since most of them involve humans at some stage of the

process, e.g. system designers might fail to anticipate a certain equipment state.

Humans are also involved in manufacturing, testing, validation, certification, and

maintenance. Any of these human operators can be directly or indirectly responsible for

a failure occurring in ATC. It is also important to note that non-technical failures should

not be directly considered as human failures. Frequently, a failure that has no obvious

technical cause is directly attributed to the human, due to a lack of a deep and

objective analysis of its causes and dynamic relations between technical and human

components of the system (Straeter, 2001).

The following sections start with the definition of a hazard, as a sub-group of equipment

failures that penetrate existing technical built-in defences and hence require human

intervention, which is the focus of the research presented in this thesis. This is followed

by the presentation of the sample of operational failure reports available in this thesis.

3.2 Definition of a hazard

The research in this thesis focuses on failures that penetrate technical defences (i.e.

technical recovery) and therefore impact (with different levels of severity) on a

controller’s performance. In this thesis, a hazard is defined as the ATC system state

resulting from an equipment failure that penetrates all existing technical defences and

affects the ability of the controller to perform his/her tasks. In different contexts a

hazard may have different definitions. For example, EUROCONTROL (2002c) defines

a hazard as “any condition, event or circumstance, which could induce an accident or

incident”. This EUROCONTROL definition is too broad and thus not in line with the

scope of this research. Thus, the term hazard in this research takes into account only

failures that require controller intervention (i.e. human recovery. The failures that

belong to this category are addressed in this thesis.


45

The following examples may help to clarify the difference between failure, hazard,

technical and human recovery, as defined in this research:

� A blocked radio frequency (failure) prevents exchange of information between a

controller and pilot. This failure presents a hazardous situation and requires the

controller’s immediate action (human recovery). Changing the frequency on the

same working position or moving to another available working position are

possible ways to recover.

� A power loss (failure) affects one set of Controller Working Positions (CWP).

Due to the independent Uninterruptible Power Supplies (UPS) electrical energy

is continuously provided and the controller does not notice this failure (no

hazard). The automatic changeover to UPS represents one example of built-in

technical defence or technical recovery (see Chapter 4 for detailed

explanation). If the continuous supply of electrical energy is not provided,

several CWPs may experience a problem, creating a hazardous situation and

requiring controller intervention (human recovery).

It should be pointed out that although this research considers only failures which lead

to hazardous situations, there are other failures as well. These other failures represent

the majority which never affect the controllers’ performance due to the effectiveness of

technical built-in defences (NATS, 2002). However, these failures still require

intervention, repair, and maintenance by engineers from the ATC system control and

monitoring unit.

After defining a failure and hazard as used in this research, the next session analyses

the nature of equipment failures in the operational environment. Details on this sample

of equipment failure reports are presented in the following section.

3.3 Supporting data: operational failure reports

Operational experience in this research is captured through a sample of operational

failure reports. They originate from four de-identified countries, referred to as Country

A, B, C, and D due to confidentiality. The following discussion focuses firstly on the

process of reporting equipment failures and their collection at the local level (i.e.

database of the ATC Centre) and national level (database of the respective Civil

Aviation Authority-CAA). The discussion continues by revealing a range of data pre-

processing problems and the corresponding solutions.


46

3.3.1 Reporting and data collection

The aim of occurrence data collection is generally to record the safety performance of

the relevant unit (e.g. ATC Centre). The data are collected on a range of safety-

relevant occurrences, such as incidents, losses of separation, equipment failures, bird

strikes, runway incursions, level busts, and others. For example, at the European level,

the EUROCONTROL ESSAR 2 document (EUROCONTROL, 2000c) provides

recommendations on the reporting and assessment of safety occurrences in ATM. As a

result, the national Civil Aviation Authorities (CAAs) specify the types of ATM

occurrences to be collected, analysed, or investigated through their mandatory

occurrence reporting (MOR) schemes (Figure 3-3). For example, the UK CAA also

specifies who can report an occurrence, what the correct reporting procedure is, and

how the details should be disseminated (in the case of the investigation). The UK CAA

states that the objective of this reporting scheme is “to contribute to the improvement of

air safety by ensuring that relevant information on safety is reported, collected, stored,

protected, and disseminated. The sole objective of occurrence reporting is the

prevention of accidents and incidents and not to attribute blame or liability” (UK CAA,

2005).

Figure 0-3 Reporting system

In aviation generally, as in ATC, data is usually stored and sorted electronically in

different databases. Collection of data in hardcopy has long been abandoned in most

of the developed countries worldwide. The type and level of database detail depends

on the unit/group/authority collecting the data (e.g. a system control and monitoring

unit, air navigation service provider, or national CAA). For example, when collecting

equipment failure occurrences, the most detailed information is available in the


47

database of the control and monitoring unit within the particular ATC Centre. This

database must contain information on all equipment failures that occurred in the ATC

Centre regardless of their impact or severity. The reason for this is because

engineering staff have to have a complete insight on all equipment failures as they are

responsible for repair and maintenance.

However, not all equipment failures are required to be reported at a national level. The

choice of those that need to reach respective CAAs is made through a review of

reported incidents or safety events on a monthly, quarterly, and annual basis. As a

result, a national database will contain only occurrences of appropriate severity

characteristics and impact on operations. As an example, the UK CAA uses a MOR

database which contains, amongst others, reports on equipment failures that impact on

the controllers’ ability to provide air traffic services. These reports are fed in from the

Engineering Reporting Occurrence Database which contains details on all technical

problems, failures, and maintenance issues, of which the majority pass unnoticed by

controllers (due to the high level of ATC systems redundancy).

Collected data is regularly analysed to assess the safety performance at national level

as well as at the level of the relevant units (e.g. ATC Centre). Furthermore, this

information is sometimes used on a wider basis for benchmarking studies and to record

the safety performance of a given region (e.g. European Civil Aviation Conference –

ECAC consisting of 41 European countries).

3.3.2 Data pre-processing problems

As previously mentioned, the research presented in this thesis uses operational failure

reports from four operational databases. Problems experienced with extracting failures

from different operational databases can be summarised as follows:

� Different reporting schemes produce different levels of reporting detail. The amount

and quality of information reported differ significantly from one report to another.

Therefore, inconsistencies between reports were identified in terms of failure impact

(i.e. severity), duration, and location.

� There are differences in terminology used (e.g. Computerised Automatic Terminal

Information Service - CATIS as Automatic Terminal Information Service - ATIS by

another name, “hotline” as ground to ground communication, usually intercom;

National Aeronautical Information Processing System - NAIPS as Aeronautical

Information Service - AIS), usage of very specific component names (e.g. Air

Ground Data Processor - AGDP, as part of datalink system).


48

� A lack of reporting culture that results in uncertainty related to data reliability and

completeness.

These problems are addressed below highlighting the approaches adopted to mitigate

them.

All reports have a short, one sentence long, summary followed by a description of the

equipment failure incident plus some additional information (e.g., date, occurrence

number, location, area code: flight information region or sector name). Unfortunately

the additional information were not always available. Additionally, Countries C and D

provided their internal severity categorisation, while Country D provided information on

failure duration. Since Country D’s dataset originates from an engineering unit, the

duration variable was measured from the first log of the failure until its final resolution.

As a result, it was possible to consistently extract four types of information. The type of

equipment/ATC functionality affected and complexity of failure type are extracted

usually from the short summary available for each report. The severity of equipment

failure is extracted using the available severity rating (if it existed) or assessing the

available information of the operational and safety impact of equipment failure and thus

applying the severity rating derived in this research (see Chapter 4, Table 4-5). Finally,

the duration variable is available only in the Country D database.

Data pre-processing is based on the classification of ATC system functionalities (see

Chapter 2). In certain reports it was very difficult to determine the type of equipment.

This problem was compounded by having only an acronym to explain precisely what

the report referred to. Consequently, several interviews have been conducted with

engineering staff from two European ATC Centres to correctly identify and classify

those ambiguous problems and assure proper classification. A glossary of terms and

acronyms is found to be a very useful tool during the pre-processing stage. Such

documents should accompany (or be an integral part of) every database as part of a

normal reporting practice.

Within one country, the number of reports may not reflect the actual number of

equipment failure incidents in the ATC Centres for a variety of reasons. The main

reasons may be the lack of reporting as a result of an inadequate reporting culture in

the ATC Centre and aviation community overall. Secondly, not all equipment failures

are included in the CAA databases. As previously explained, only failures of certain


49

severity (i.e. impact on ATC operations and controller performance) tend to be reported

to the CAA. As a result, the available operational failure reports are neither necessarily

complete nor reliable (i.e. they lack the detail on the context surrounding a reported

occurrence). To date, no measure of completeness and reliability of occurrence

databases has been produced. This is a task for future research.

3.3.3 Available operational failure reports

As stated previously, there are four sources of data on equipment failures included in

this thesis, Countries A, B, C, and D. The first three data sets are from Civil Aviation

Authority (CAA) databases for a given time period. In other words, these are equipment

failures reported in the CAA database for all ATC Centres within the national

boundaries of these countries over a given time period (usually a year). The fourth data

source (Country D) represents data from the system control and monitoring unit of one

ATC Centre. Table 3-1 gives a summary of the available data.

Table 0-1 Summary of available data, number of reports, and equipment failure incidents per country

Country Source of data Time period

available

Average flight hours flown for available time

period

Total number of reports pre-

processed

Total number of equipment

failures reported

A CAA 1999-2003 1,375,800.00 1,378 791

B CAA 2001-2005 1,027,870.00 1,393 1,324

C CAA 1992-2004 389,245.68 3,340 448

D System control

unit/ATC Centre 08/2000-2004 428,502.22 16,697 7,788

Total 22,808 10,351

After pre-processing of all available equipment failure reports (22,808), more than ten

thousand reports (i.e. 10,351) are identified as equipment failures in air traffic control

(Table 3-1). The remaining reports mainly comprised of equipment related reports

outside of the national airspace, multiple reports filed for the same occurrence to reflect

multiple finding or causes identified, as well as reports on non-ATC equipment and

other non-technical types of incidents (e.g. human error, runway closures due to non-

equipment issues, scheduled maintenance, software updates, and scheduled hardware

changes).


50

The time period studied, for countries A and B, could be considered steady (uniform)

with respect to the ATC service provided and other aviation related factors (e.g. traffic

levels, jet fuel prices, airline fares, regulations). However, one modern ATC Centre was

opened in Country A in the second half 2001. This resulted in a relatively large number

of early failures of individual components early in 2002. This is a recognised

characteristic of the initial life or ‘burn-in period’ of any newly implemented system

(Figure 3-4).

Figure 0-4 ”Bathtub” model of reliability for electronic components (Leveson, 1995)

Country B underwent a complete modernisation of its ATM system in 2000. Given that

a typical ‘burn-in period’ range between 30-90 days (IEEE, 1998), it is reasonable to

assume that the system was well integrated and settled for the period of the data (i.e.

2001 to 2005). Therefore, the average number of incidents reported in this period could

be considered representative and appropriate for further analysis.

However, the time period available for Country C consists of 13 consecutive years (i.e.

1992 to 2004). This country went through extensive regulatory changes throughout the

1980’s. The change in air service licensing assured that any operator that could prove

financial viability and meet safety standards would obtain a license. As a result, by the

end of the 1980’s, the number of operators had more than doubled. At about the same

time, the Government decided to commercialise most of its service provision activities.

Thus air traffic and other services formed new state-owned commercial enterprises.

However, all of these changes were firmly embedded into the system until the 1990’s,

and therefore, the sample provided could be considered stable and appropriate for

further analysis.

Country D is unique in that it provided data from a single engineering unit database and

therefore represents the most detailed data source in this research. It covers the


51

shortest period available (3.5 years) but contains the highest proportion of failures or

75 percent of all available reports.

Although the available sample has a significant number of operational failure reports,

this still does not indicate how representative these reports are of the operational ATC

environment. For this reason, a methodology for the top down total aviation system

safety is developed. This methodology enables determination of the contribution of

ATC equipment to the safety of the overall air transport system based on past

research. Once this is established, the same methodology is applied using the

operational failure reports and then the results are compared. This methodology and

the subsequent validation of the available operational data are presented in the

following section.

3.4 Methodology to assess the relevance of supporting data

This section develops the methodology for an assessment of the available sample of

operational failure reports. In order to assure the relevance of this sample, this section

builds a methodology for its validation. In short, the contribution or risk budget of

equipment failures to the overall safety of air transport system extracted from past

literature is compared to the result obtained from the analysis of available operational

failure reports. The section starts by identifying the overall aviation Target Level of

Safety (TLS) and derives risk budgets for ATM and its ATC component. It concludes by

determining the risk budget of ATC equipment. In other words, this methodology

determines the contribution of ATC equipment failures to the safety of the overall air

transport system. This finding is then compared to the results of the preliminary

analysis of the available operational failure reports.

3.4.1 The accident to incident ratio

Aviation Target Level of Safety (TLS) expressed only in terms of accidents has two

potential limitations. Firstly, the number of accidents is small for any adequate

statistical analysis. Non-accident data, such as loss of standard separation between

aircraft in controlled airspace, is therefore necessary to establish the occurrence of any

trends. Secondly, the number of accidents (or accident rate) is not necessarily the best

measure of safety performance. For example, the currently used target of one accident

in 107 flight hours demands the collection of operational data over many years to

demonstrate whether the TLS has been met. A single accident may violate the TLS,

whilst many years without an accident will satisfy the TLS, but conceal any

deterioration in safety prior to an accident (Graham, Kinnersly, and Joyce, 2002). In


52

this context, past safety analyses (not only in aviation) have used the number of

incidents together with the assumed accident/incident ratio. The United States Federal

Aviation Administration (FAA, 2000) cites several different analytical approaches. The

two most common of these are discussed below.

In the 1940s, Heinrich introduced the idea of the existence of accidents where injuries

did not occur, but considered only damage to property (Heinrich, 1941). This led to the

creation of the so-called ‘Heinrich pyramid’ with established proportions of accidents,

serious incidents, and incidents; 1:29:300 (Saldana et al., 2002). After these initial

studies, there was stagnation in the theoretical underpinnings of safety investigations

until the practical work of Byrd in the 1970s. Byrd carried out his work in a steel factory

and revised Heinrich’s proportions to 1:29:600 (Saldana et al., 2002).

However, whilst both of these studies are valuable in their statistical analyses, they do

not seem to be appropriate in dealing with equipment failures in ATC, at least not in the

ratios they offer. Both studies are designed to determine the risk and related ratio of

on-the-job accidents and incident. The reason for the weaknesses in both studies may

originate from their design and in particular, the bias of analysing accident reports filed

by supervisors only (which tend to blame injuries on workers) and much lower levels of

equipment reliability and integrity compared to the systems used in ATC today.

For the purpose of the research presented in this thesis, additional attention has been

given to the ratio between accident and incidents induced by ATC equipment failures.

However, a EUROCONTROL safety assessment study assumed that one in 10,000

equipment failures will contribute to an aviation accident (EUROCONTROL, 2004c), an

assumption which is in line with the high reliability requirement for the overall ATC

systems, as well as ATC equipment. A number of arguments can be made to suggest

that in future, this proposed ratio will decrease:

� The number of incidents should decrease due to continuous safety initiatives and

hazard prevention programmes;

� The probability of an incident leading to an accident should decrease due to

increases both in equipment reliability and advanced solutions for redundancy

and diversity (dissimilar redundancy);

� Changes should be seen in the type of incidents occurring, in that as a result of

enhanced risk management approaches, the frequency of serious incidents

should reduce;


53

� There should also be a decrease in the number of software-related incidents,

which are prevalent today as discussed earlier. Hardware-related incidents

should also diminish.

The arguments discussed above infer the step change in software and hardware

reliability as a result of considerable operational experience, knowledge, and expertise.

For example, in its requirements for the software configuration EUROCONTROL states

that reporting, tracking, and corrective actions are set in place to mitigate any software-

related problem (EUROCONTROL, 2003i). Note also that a decrease in the number of

incidents should only consider the steady state (i.e. useful life) as captured in the ‘bath

tub’ reliability model (Figure 3-4).

It has been highlighted that perception of risk only in terms of accidents tends to mask

the actual safety issues. For this reason, it is important to include the number of

incidents so as to estimate the appropriate accident/incident ratio. After the discussion

of accidents and incident ratio, the following section discusses the units of

measurement used in aviation and thus the different perspectives obtained in the

investigation of a critical event.

3.4.2 Units of measurement

The rate of any critical event represents the number of occurrences (e.g. equipment

failures, incidents, accidents) divided by the exposure to those events. For example,

aviation accident statistics are presented in a variety of ratios and units, called units of

measurement. The most frequently used are the number of accidents per operation

(take off or landing), per million flight hours flown, per flight, per million departures, per

million aircraft-miles, per million aircraft-hours, per million passenger-hours, and per

million passenger-miles.

No single measurement gives a complete picture of the critical event under

investigation. Each of these units gives only one perspective, whilst possibly hiding

others. For example, rates per million passenger-miles are most useful for comparing

air transport and other modes of transport, whilst aircraft departures are suitable for

comparison of accidents between small commuter jets and large commercial jets (e.g.

BA46 and B747, respectively). In addition, for the determination of the required

performance of the landing aids e.g. Instrument Landing System (ILS) or Microwave

Landing System (MLS), the only appropriate measure would be the number of landings


54

per time period of interest. Any other measure would mask the true performance

values.

In addition to the units of measure, accident rates are determined by the definition of

the critical event as well. These critical events range from accidents, fatal accidents,

hull losses, to the number of fatalities or injuries. An accident, as defined by ICAO

Annex 13 (ICAO, 2001d), involves “an occurrence associated with the operation of an

aircraft, which takes place between the time that any persons board the aircraft with the

intention of flight and that all such persons have disembarked, in which any person

suffers death or serious injury, or in which the aircraft receives substantial damage.”

This definition therefore comprises fatal accidents as well as hull losses. Thus, in

dealing with various accidents rates it is crucial to be aware of the precise definition of

both the critical event and the unit of measurement used.

The current rate of aircraft accidents per million flying hours has remained constant

over recent years. If the same accident rate is assumed for the future together with

predicted increases in traffic levels, there will be an increase in the absolute number of

accidents. Using the current accident rate, ICAO has predicted that by the year 2010

there will be an aircraft accident per week, i.e. 52 accidents per year (Hai, 2004). This

is the reason why the US FAA and other aviation authorities have identified the need to

significantly decrease the risk of aircraft accidents.

The following sections propose a methodology for the derivation of aviation target level

of safety (TLS) based on the rate of aircraft accidents (defined as a number of

accidents per flight hour). An accident is defined according to ICAO, while the flight

hour has been chosen as the most appropriate measure of risk induced by equipment

failures. It is usually more convenient to work in terms of flight hours rather than

operational hours of an ATC unit or sector. This approach avoids difficulties and

differences associated with the geographical coverage of the system(s) being

considered, phase of flight, the density and complexity of airspace, as well as available

systems and equipment (e.g. number of radars, navigation systems, communication

systems). This is also in line with Required Communication, Navigation, and

Surveillance Concepts (RNC, RNC, RSC) as defined in the previous Chapter. In short

the proposed methodology starts by identifying the high-level aviation target level of

safety further focusing on the precise contribution of equipment failures, as the type of

occurrence under investigation in this thesis.


55

3.4.3 The acceptable risk or target level of safety (TLS)

The methodology to determine the contribution of equipment failures to the safety of

the overall air transport system is organised in several steps. Firstly, existing aviation

standards for Target Level of Safety (TLS) are assessed. Secondly, the contribution of

ATC to the risk of an aircraft accident is determined. Thirdly, the contribution of ATC

equipment to the ATC risk budget is determined. These findings are than extrapolated

to the year 2020, as the target year in this research in line with the European

Commission’s ‘Vision 2020’ (European Commission, 2001). The final step involves

validation of the available sample of operational data using the same methodology.

These steps are presented in the following sections.

3.4.3.1 Existing standards

Technology and engineering have brought numerous inventions and benefits to the

modern way of life. Whilst these benefits are welcome, the risks associated with them

are not. The high pressure on the engineering world to reduce risk and increase safety

comes at a financial price. Therefore, it is important to manage the trade-off between

risk and the cost of its reduction.

As a result, there are certain degrees of risk that must be accepted. Determining the

acceptable level of risk1 is generally the responsibility of management and is based on

several principles. These are the objective to be achieved, the alternatives available,

and the consequences and values that can be identified. Based upon this, the TLS is a

quantified level of risk (or potential loss) that a system should be designed to deliver

(Brooker, 2004). In aviation, the TLS is usually expressed as a number of aircraft

accidents per flight hour flown, which is used in this thesis, as indicated previously.

The concepts of TLS and risk budgeting are directly linked. Indeed, risk budgeting

represents a top-down distribution of TLS (or total aviation risk) between the

independent sub-categories. The logic behind this process is to specify the maximum

1 Note the difference between acceptable and tolerable risk. Tolerability refers to a “willingness

to live with a risk so as to secure certain benefits and in the confidence that it is being properly controlled. Tolerable risk, is not ignored, but is controlled and reduced further if possible. On the other hand, acceptable risk means that we are “prepared to take risk as it is” (Reid, 1996). It should be noted also that acceptable risk is a relative term and is based on different risk perceptions: individual, public (group of individuals), industry (industry usually needs additional pressure to declare a product as unsafe), and risk perception by safety experts. They all differ in the level of risk they are willing to ‘accept’.


56

acceptable risk for each sub-category, so that each one has to produce equal or lower

risk than prescribed (see Figures 2-1 and 2-3).

As pointed out by Brooker (2004), there are several methods to derive the TLS. In most

cases, the analysis starts from the current situation and uses an improvement factor to

derive the desired TLS. In some cases, this improvement factor may be established as

a continuing trend from the past translated into the future. It should incorporate traffic

growth factors, factors representing changes in the systems involved, the operational

procedures, and work practices. In other cases, it may be based on a common

agreement between technical experts, with the main idea underlying it being to set

challenging, but still realistic safety improvement targets.

The following sections provide an overview of the most relevant aviation TLS analyses.

The level of diversity between these approaches highlights the complexity of the

problem and the need for a consistent top-down total air transport system approach.

3.4.3.1.1 Joint Aviation Authority

The Joint Aviation Authority (JAA) document JAR-25.1309 is one of the main regulatory

documents in aviation. It also defines the fundamental principles that govern aircraft

design and certification. JAR 25.1309 defines the risk of a serious accident due to

“operational and airframe-related causes” to be in the order of one per million hours of

flight. About ten percent of the number of accidents related to operational and airframe

causes is attributed to aircraft equipment failures (e.g. hydraulics and electrical

systems) and the rest (90 percent) to other operational aspects (JAA, 1994). A

EUROCONTROL review of existing TLS standards and practices (EUROCONTROL,

2000a) argues that this requirement is based on data from the 1960s and as such is

outdated. Furthermore, the JAR requirement is related to aircraft design,

encompassing only aircraft equipment, without consideration for the other components

of the air transport system (including ATM). Accordingly this JAR requirement needs to

be informed with all the major changes in the aviation industry since the 1960s. The

following paragraphs indicate several key factors that symbolise the changes and

growth in aviation since the 1960s.

There has been a rapid expansion in the air transport industry over the last four

decades due to a number of factors, including growth in the world economy,

advancement in flight technology and the deregulation of the airline services. The result

of these forces has been a steady decline in airline costs and passenger fares, which


57

has further stimulated traffic growth. As an example of economic growth, ICAO cites

that there has been an increase in total gross domestic product (GDP) by a factor of

3.8 over the same period (ICAO, 1997). The GDP is considered to be the most

appropriate available measure of world output and indicates the health of the global

economy.

Changes in flight technology have also had a major effect on the growth in travel

demand. The modern era of air transportation began in the 1960s. The major drive was

the replacement of piston engines with jet engines, which was accompanied by

increased speed, reliability, and comfort. This change led to a reduction in operational

costs, which in turn led to increased travel demand.

In addition to this, changes in the regulatory environment in both the US and Europe

have had a big effect. The deregulation of airline services in the US in 1978 allowed

airlines to improve services, reduce average costs, increase routes, and increase

efficiency of scheduling. In Europe, the introduction of a single market for aviation

services by the European Union in 1992 has seen similar changes to that seen in the

USA.

The ICAO Manual on Air Traffic Forecasting (ICAO, 1985) suggests three methods for

forecasting future civil aviation traffic. These methods are trend projection, econometric

analysis, and market and industry survey. Econometric forecasting is the only method

that takes into account various economic, social, and operational factors affecting air

traffic. The objective here is to translate the relevant factors into projections of future

traffic growth. Then the traffic growth factors are reviewed further to incorporate

prospective changes by other factors that are not accommodated in the econometric

analysis.

The predicted traffic growth will influence target safety levels through the increase in

the number of flight hours forecast. However, there are other factors, not necessarily

included in this forecast of traffic growth, that have the potential to influence the level of

safety. Some of these factors are: the growth in the total number of aircraft flying as

well as in the passenger capacity of aircraft (e.g. Airbus 380, Airbus 350, Boeing 7E7

Dreamliner), increased airport and airspace congestion, technological development

(e.g. advanced safety nets, satellite-based CNS/ATM), and pressure on finding the

tools to control and mitigate human error. Another important factor not considered is


58

the increasing effect of environmental policies on aviation, in particular on air fares,

costs, and restrictions to possible routes.

Therefore, in line with the EUROCONTROL argument the JAR requirement should be

informed with an analysis based on an updated data sample of accident rates from the

last four decades. At the same time, future predictions and regulations should be based

on econometric forecasting, which will involve the effect of traffic growth as well as

other economic, technical, and operational factors.

3.4.3.1.2 UK Civil Aviation Authority

The UK Civil Aviation Authority (CAA) has calculated a worldwide fatal accident rate

using the Worldwide Aircraft Accident Summary (WAAS) aviation database sample2 for

the period 1990-1999 (UK CAA, 2000). The CAA based its analysis on this sample and

the following assumptions (EUROCONTROL, 2005):

� A fixed annual traffic growth rate until the year 2020 (i.e. 4 percent for western

built jets); and

� A constant number of fatal accidents per year (i.e. eight fatal accidents each

year).

Based on these assumptions, the UK CAA predicted a rate of 1.8E-07 fatal accidents

per flight for the year 2020. For the purpose of the methodology presented in this

Chapter, this target has been translated into the rate per flight hour using the

information available on the Boeing web site (Boeing, 2004) as follows. The average

flight in 1982 was approximately 1.4 hours, while in 2002 it was 1.94 hours. If this trend

continues, it is determined in this research that the average flight in 2020 will be 2.43

hours. Using this assumption, the UK CAA’s TLS for the year 2020 corresponds to

7.4E-08 fatal accidents per flight hour.

3.4.3.1.3 International Civil Aviation Organisation

There have been several attempts by ICAO to derive aviation target levels of safety.

These originate from a number of different studies and reports, which are presented

below, from the earliest to the most recent.

2 Information published by Flight International (monthly publication of Reed Business

Information Group). Includes accidents and serious incidents worldwide with the exception of the Commonwealth of Independent States (CIS) before 1990 (former Soviet Union). The data set covered only commercial aircraft or aircraft with maximum takeoff weight above 5.7t.


59

� ICAO North Atlantic Systems Planning Group (NATSPG) - the ICAO NATSPG

initially developed a method using the data on fatal accidents of jet aircraft in

the period from 1959 to 1966 (EUROCONTROL, 2000a). Based on available

data3 this analysis estimated fatal accident rate of 2.34E-06. The analysis

progressed by assigning a factor 0.1 for accidents due to collision. The basis for

this assumption is not evident or recorded. An improvement factor between two

and five was further applied to justify the use of historical data on future targets

(EUROCONTROL, 2000a). This resulted in a TLS ranging between 12E-08 to

4.6E-08 fatal accident per flight hour due to collision. Finally, the analysis

apportioned the value of TLS to three flight dimensions and thus calculated a

TLS for collision due to loss of lateral separation to be between 4E-08 and

1.5E-08 fatal accidents per flight hour.

� ICAO Review of the General Concept of Separation Panel (RGCSP) - in 1995,

the ICAO RGCSP reviewed several approaches to deriving a TLS for ATM and

accepted the one developed by ICAO NATSPG. The RGCSP assumed a total

accident rate from all causes to be 1E-07 per flight hour for the year 2010. This

TLS is based upon the NATSPG analysis extrapolated to the year 2010

(Brooker, 2004). Based on the contributions from the US (TLS ranging between

2E-09 and 7E-09) and the USSR4, the RGCSP agreed upon TLS value that

should be used for establishing any vertical minimum performance

specification. This value is equal to or better than 5E-09 fatal accidents per

flight hour arising from collisions due to any cause for the period 2000 to 2010.

This value of a TLS is also indicated in the ICAO Annex 11 (ICAO, 2001c);

� ICAO Annex 11 - in the situation where “fatal accidents per flight hour” is

considered to be an appropriate metric, ICAO Annex 11 (ICAO, 2001c)

proposes a TLS of 5E-09 fatal accidents per flight hour per dimension after the

year 2000. Although ICAO Annex 11 does not provide any justification for this

TLS, it is assumed that this value is taken from the ICAO RGCSP. For the

period prior to the year 2000, ICAO Annex 11 recommends the use of a TLS of

2E-08 fatal accidents per flight hour per dimension; and

� ICAO All-Weather Operations Panel (AWOP) - the objective of the ICAO AWOP

was to assess the required navigational performance (RNP) for approach,

landing, and departure phases of flight (ICAO, 1994). Based upon historical

3 Based on 36 fatal accidents and an estimate of 15.5 million flight hours during the period

1959-1966. 4 The USSR developed a series of targets for progressive implementation, such as 1E-08 from

1990 to 2000, 5E-09 for 2000-2010, and 2E-09 for 2010 onwards (ICAO, 1995).


60

data5, ICAO’s calculation determined the average hull loss to be 1.87E-06 per

flight or 1.27E-06 per flight hour. Based on this historical data, ICAO proposed a

TLS for hull loss per flight hour to be 1E-07. The rationale for this risk

improvement over the historical accident rate is the removal of pilot errors by

the use of glass cockpit aircraft and tunnel incident alarm. The glass cockpit is a

system of electronic displays presenting all information on an aircraft's situation,

position, and progress. The tunnel incident alarm is an alert that is triggered if

the aircraft unintentionally leaves the assigned flight path, the “tunnel”, during

the approach and landing phases of flight. Additionally, the objective in aviation

safety is to reduce the number of accidents despite increasing flight hours. This

is essential if public confidence in aviation is to be maintained as the global air

transport system expands.

3.4.3.1.4 Summary of the various TLS analyses

The previous section has given an overview of the research on aviation TLS which is

summarised in Table 3-2 (based on the information available). This table enables

comparison of the TLS taking into account the source of data, the time period covered

by the data set, the type of accident, the type of aircraft operation, and the TLS unit

used.

Once again the differences in the derivation of TLS should be pointed out. The

summary presented shows the level of discrepancy in the method, data set, and

taxonomies used. The major factors that drive the differences in the calculation of

target levels of safety are:

� Type of accident (accident, fatal accident, hull loss),

� Weight of aircraft involved in the accident,

� Differences in the definitions (i.e. taxonomies used),

� Type of operations analysed: scheduled vs. non-scheduled, commercial vs.

non-commercial (military, freight, general aviation), registered vs. non-

registered, domestic vs. international,

� Type of aircraft included: jets vs. turbo props,

� Time frame of the data set analysed,

� Source of the data,

5 Data set covers hull loss accidents for the period from 1959 to 1990 for commercial jet aircraft

whose weight exceeds 60,000lbs. Exposure percentages are based on an average flight duration of 1.47h. A hull loss accident is defined as an accident where the primary cause is hull loss or aircraft damage beyond economical repair.


61

� Region involved in the analysis (with or without former Soviet Union),

� Targeted year for the TLS calculation: current vs. future levels.

Table 0-2 Summary of various analyses on aviation TLS

Reference Title Database

Scope

Target year

TLS Region/time period

Type of operation/

weight/type of accident

Joint Aviation

Authorities

JAR 25.1309 Large

Aeroplanes - Advisory

Material - AMJ

Not specified Worldwide

1960s Serious accident

Not specified

1E-06 per flight hour

UK Civil Aviation Authority

Aviation Safety Review

CAP 701 WAAS

Worldwide 1990-1999

Jets & turbo props/

MTOW>5,700t/fatal

accidents

2020 1.8E-07 per

flight/7.4E-08 per flight hour

ICAO

North Atlantic Systems

Planning Group (NATSPG)

Not specified Worldwide Jets/1959-

1966 Not

specified 2.34E-06 per

flight

ICAO

Review of the General

Concept of Separation

Panel (RGCSP)

Not specified Not

specified Jets/fatal accidents

2010 1E-07 per flight hour

ICAO Annex 11 Not specified Worldwide En route fatal

accidents

After the year 2000

5E-09 per flight hour per

dimension (1.5E-08 per flight hour)

ICAO

All-Weather Operations

Panel (AWOP) 15

th meeting

Not specified Worldwide 1959-1990

Jets/MTOW> 60,000lb/ hull loss

accidents

Not specified

1E-07 per flight hour

Key: MTOW = maximum take-off weight of the aircraft

After the review of the most relevant analysis and methods of TLS calculation, the TLS

of 1E-08 accidents per flight hour is used as the baseline for the year 2020 (target year

of the research presented in this thesis). The reasons for using this baseline are:

� The rate of 1E-07 is currently used as a target by ICAO for both fatal accidents

and hull loss accidents (see Table 3-2);

� With the overall aim of reducing the accident rate given the current safety

targets, it is reasonable to aim at 1E-08 accidents per flight hour in the year

2020;

� The analysis conducted by the UK CAA to predict the role of fatal accidents for

2020 (i.e. 7.4E-08 fatal accidents per flight hour).


62

Once the TLS for the year 2020 is determined, the next step is to apportion the

contribution of ATC in the overall air transport TLS. To establish this, several studies

have been reviewed. The key findings are presented in the following section.

3.4.4 Target level of safety and Air Traffic Control risk budgeting

The next step is to determine the risk budget allocation for the ATC system as a

component of the overall air transport system, i.e. determine the contribution of ATC.

According to the results of the UK CAA’s analysis, the contribution of ATC and ground

aids to aircraft accidents is 1.7 percent (Table 13 in EUROCONTROL, 2005).

EUROCONTROL currently uses 2 percent as a maximum direct contribution of ATM to

aircraft accidents within the European Civil Aviation Conference (ECAC) region. This

figure was derived based upon historical data (ICAO ADREP database focused on the

ECAC region) from which a contribution of ATC is determined to be 1.1 percent

(EUROCONTROL, 2001a). Recognising that only ATC causes were accounted for

(without contribution of other ATM components, such as ATS, ASM, AFTM)

EUROCONTROL allowed additional 0.9 percent, resulting in 2 percent of ATM

contribution to aircraft accident. This figure has been further validates via discussions

with EUROCONTROL Safety Regulatory Commission’s task force Hazard

Classification Matrix (HCM). EUROCONTROL has defined “the maximum tolerable

probability of ATM directly contributing to an accident of a commercial air transport

aircraft” in the ECAC region to be 1.55E-08 per flight hour (EUROCONTROL, 2001b).

This figure is based on the rate of aircraft accident for the year 1999 (extracted from

ICAO ADREP database focusing on the ECAC region) with direct ATM contribution (2

percent) and a forecast of 6.7 percent increase in the traffic volumes for the period

1999-2015 (EUROCONTROL, 2001a).

In the Netherlands, a study by the national research laboratory (NLR) used a sample of

civil aircraft accidents that occurred worldwide during the period 1980-1999, mostly

based on ICAO database (van Es, 2003). This study determined that ATM-related

accidents represent 8 percent of the total number of accidents. Additionally, 28 percent

of these ATM-related accidents are directly caused by ATC, which makes the ATC

contribution to aircraft accidents approximately 2.2 percent. The difference in the

contribution of ATC in these two studies is due to the difference in classification of

causal factors. While the UK CAA analysis divided all underlying factors into primary,

causal, and circumstantial groups, the NLR analysis followed the recommendation by


63

ICAO and did not use this distinction. The NLR study considered an occurrence as a

causal factor only if that occurrence was part of the chain of events leading to the

accident. The NLR approach seems to reflect better the aim of determining the overall

ATC contribution to aircraft accidents.

The results presented above need to be augmented for possible statistical error and

uncertainties linked to the reporting processes as well as to provide additional

protection for the future. As previously discussed, EUROCONTROL allowed additional

0.9 percent for statistical error and uncertainties in the calculation of the ATM safety

targets for ECAC region based upon historical data for only one component of ATM,

namely ATC (EUROCONTROL, 2001a). With this in mind, together with the results

from UK CAA and NLR studies, this thesis uses a maximum contribution of ATC of 3

percent. Thus, using the previously established TLS for air transport system for the

year 2020 (in the previous section), apportioned contribution of ATC is considered to

be 3E-10 per flight hour. Now, after deriving the TLS for ATC specifically, this functional

block should be divided between human operators, equipment, and procedures. This

approach now gives the opportunity to define the appropriate risk induced by failure of

ATC equipment which is presented in the next section.

3.4.5 Target level of safety and Air Traffic Control equipment risk budgeting

It is important to determine the contribution of equipment (or their failure or malfunction)

to the ATC risk budget. The historical data on the proportion of incidents in which

equipment failure is implicated varies to a certain degree. Interviews with system

control and monitoring staff at two European ATC Centres6, as well as the

approximation used by the CORA 2 documentation (EUROCONTROL, 2004c) reveal

that equipment failures are the causal factor in 0.01 or one percent of all incidents.

Although this assumption is based on the ATM system and not its ATC component

only, it is used with other sources of information to inform the ATC equipment risk

budgeting within overall air transport system.

More focused approach is provided by the NLR study (van Es, 2003). This study

determined that the particular causal factor ‘ATC ground aid malfunction or unavailable’

has been attributed to 5 percent of all ATM related accidents or 18 percent of all ATC

related accidents. It should be noted that this causal factor includes ‘unavailable’ ATC

6 Based upon private communications with staff at two European Area Control Centres (ACCs).


64

equipment meaning equipment that was taken out of service by ATC staff, presumably

for maintenance reasons. In addition, the research was based on data samples that

incorporated older systems with lower levels of automation. Future systems are shifting

more towards a higher level of automation and higher reliability, as discussed in the

previous Chapter.

Therefore, it can be approximated that equipment failures represent the causal factor in

10 percent of all ATC related accidents (or 3 percent in all ATM related accidents). This

is based on the assumption that unscheduled failures constitute about 50 percent of

the failures in the NLR analysis discussed above. This approach derives a risk of an

ATC equipment failure leading to the aircraft accident to be 3E-11 per flight hour. The

reasoning presented seems to correlate with the widespread argument that human

error represents the causal factor in 70-80 percent of all accidents (Reason, 1997).

Although there is some evidence that the majority of these human errors represent

organisational errors (Johnson and Holloway, 2004). A graphical representation of the

determined risk budgets is given in Figure 3-5.

Figure 0-5 Aviation TLS and risk budgeting

After assessing the contribution of ATC equipment failures to the overall risk of aircraft

accident, it is important to validate these findings with some operational experience.

This is achieved in the following section by analysis of operational failure reports from

three countries.


65

3.5 Preliminary analysis and validation of operational failure reports

The previous sections described the process of deriving an overall aviation TLS for the

reference year 2020 and further risk budgeting for ATC equipment. In order to justify

the use of the available sample of operational reports in this thesis, this sample is

validated by the proposed TLS methodology. This is presented in the following

paragraphs.

Having the accident rate for the year 2000 (EUROCONTROL, 2005) and predicted

accident rates for the year 2010 (1E-07; Brooker, 2004) and 2020 (1E-08, used in this

research), it is apparent that future safety levels are predicted to improve tenfold every

decade. This is in line with the attempts of various aviation institutions to significantly

improve future aviation safety levels (e.g. FAA, ICAO). The next step is to implement

the established rate of improvement to the ATC equipment failures.

Using the same analogy and the ratios within an air transport system, as presented in

Figure 3-5, it is possible to translate the 2020 rate of ATC equipment contribution to

aircraft accident to the present levels (i.e. 2000). The calculation presented in section

3.4.5 showed that for the year 2020 this effect is of the order of 3E-11 per flight hour.

Using the reverse logic, this effect equals to the level of 3E-09 for the year 2000. In

other words, based on the past research and established ratios the contribution of

equipment failures to the overall safety of air transport system in the current period is in

the order of 3E-09 per flight hour.

Having established the contribution of equipment failures to the overall safety of the air

transport system based on past research, it is necessary to calculate the same value

using the available operational failure reports. The conformance of ATC equipment

budgeting obtained from past research and available failure reports would indicate that

the available sample is representative of equipment failures occurring in the operational

ATC environment.

Firstly, it is important to discuss the overall commercial air transport accident rates for

the three countries analysed. These rates are slightly higher than the worldwide

average (1E-06 per flight hour; see Figure 3-5), ranging from 1E-05 and 9E-06 aircraft

accidents per flight hour). Secondly, it is necessary to discuss the available sample of

operational failure reports by focusing on the frequency of equipment failure reports per


66

year and per source. The incident reports used in this section were from three sources,

namely three Civil Aviation Authorities (CAAs), presented as Country A (for the period

1999 to 2003), Country B (for the period 2001 to 2005), and Country C (for the period

1992 to 2004). The final results of this preliminarily analysis of available operational

reports are presented in Table 3-3. The average number of failures is calculated for all

three data sets (column 4). This is followed by the calculation of incident rates based

on the average flight hours flown for the given time periods (column 5). The final step

involved adjustment of the calculated incident rate to give the probability of accident

caused by equipment failure (using the accident to incident rate of 1 in 10,000) as

shown in the last column on Table 3-3. In other words this calculation produced the

operational level of safety for three countries and three respective time periods.

Table 0-3 Analysis of operational failure reports and results

Country Year

Total number of equipment

failures reported

Average number of equipment

failures per year

Rate of failure - incident (per flight hour)

Rate of failure - accident (per flight

hour)

(1) (2) (3) (4) (5) (6)

A

1999 100

158.2 1.15E-04 1.15E-08

2000 107

2001 122

2002 287

2003 175

B

2001 184

264.8 2.58E-04 2.58E-08

2002 237

2003 171

2004 247

2005 485

C

1992 28

34.46 8.85E-05 8.85E-09

1993 38

1994 41

1995 21

1996 16

1997 42

1998 40

1999 25

2000 38

2001 27

2002 46

2003 42

2004 44

Based on the contribution of equipment failures to the overall safety of air transport

system extracted from the past research and overall TLS methodology (3E-09 per flight


67

hour), we can conclude that the TLS levels acquired from operational reports (last

column in Table 3-3) show a degree of conformity.

Even higher levels of conformity would be achieved with setting of higher level of TLS

for year 2000 (data indicate 1E-05 as opposed to 1E-06 accepted within aviation

community). Furthermore, better tuning of the current and future trade-offs within the

air transport system (see Chapter 2, Figures 2-1 and 2-3) would additionally enhance

the proposed methodology for determination of risk budgeting of the ATC equipment.

Future advancements in technology, changes in the levels of traffic, and overall

changes in the ATC/ATM philosophy (e.g. shifting of separation responsibility from the

ground to the air) have a potential to improve safety. At the same time it is reasonable

to assume that the distribution of the levels of risk within the air transport system will

change. The results specific to ATC given here could be used as an input to a

complete safety analysis that should consider trade-offs between the various

components of the aviation system to realise risk budgets for a safe and cost effective

system. Finally, the severity of the reported incidents could be used to inform the

weighting scheme and to better reflect the accident to incident ratio, as the above

analysis considered all incidents equally.

In short, the above analysis indicates that the available operational failure reports are a

representative sample of equipment failures occurring in ATC Centres worldwide.

Having established the appropriateness of this sample, the following Chapter moves

toward the identification of operational characteristics of equipment failures extracted

from past research and operational failure reports.

3.6 Summary

This Chapter starts with a precise definition of equipment failures and hazards,

representing a sub-group of equipment failures that require human intervention (or

human recovery). It continues by presenting a sample of operational failure reports

available in this research. After discussion on the reporting schemes designed to

capture incident occurrences, including equipment failures, the Chapter continues by

highlighting data pre-processing problems and solutions applied to overcome them. In

order to assure the relevance of equipment failures captured in the sample available,

the remainder of the Chapter builds a framework for its validation. This framework for

risk assessment, based entirely on past literature, begins from the risk assessment of

the overall air transport system and focuses on one component, namely ATC


68

equipment. In other words, this section determines the maximum allowed accident risk

imposed by ATC equipment failures for the target year 2020.

The contribution of equipment failures to the overall safety of air transport system

extracted from past literature have then been compared with the result obtained from

the analysis of available sample. This analysis showed a degree of agreement between

the theoretically assumed and operationally extracted levels of ATC equipment risk

budgeting. In other words, the available operational failure reports are a representative

sample of equipment failures occurring in operational ATC environment. Hence, the

next Chapter proceeds with a detailed assessment of the equipment failure

characteristics extracted from operational failure reports and available literature.

Chapter 4 Equipment Failures in ATC

69

4 Equipment Failures and Technical Defences in Air Traffic Control

The previous Chapter showed that operational failure reports available in this thesis

constitute a representative sample of equipment failures occurring in the operational Air

Traffic Control (ATC) environment. This Chapter moves toward the identification of the

operational characteristics of equipment failures. These are extracted from past

research and more than 20,000 operational failure reports. Special attention is paid to

the impact that equipment failures may have on ATC operations, and as a result a

severity rating scheme has been designed to support the research presented in this

thesis. Having discussed the consequences of equipment failures and their impact on

ATC operations, it is important to discuss how such consequences can be prevented or

mitigated. This involves the process of recovery from equipment failure and a

distinction can be made between technical and human recovery. This Chapter

discusses technical recovery by reviewing the existing technical built-in defences,

whilst the next Chapter discusses human (i.e. controller) recovery. A subset of

equipment failure characteristics relevant to ATC operations is then used in this

Chapter to develop a novel tool for the assessment of the severity of equipment

failures, known as the qualitative equipment failure impact assessment tool. This tool

enables an assessment of the overall impact of an equipment failure on ATC

operations.

4.1 Equipment failure characteristics

When dealing with any type of equipment failure, it is important to understand its

underlying characteristics. In other words, it is important to take into account issues like

causes, consequences, duration, and complexity. Thus, a detailed hazard analysis

would capture the most important characteristics of a failure and the context

surrounding its occurrence (Leveson, 1995). The following sections explain several

important failure characteristics:

� ATC functionality affected;

� Complexity of failure type;


70

� Time course of failure development;

� Duration of failure;

� Potential causes of equipment failure; and

� Consequences of equipment failure.

The consequences of equipment failures are discussed on several different levels,

ranging from their impact on the individual (i.e. the air traffic controller), the operations

room, the ATC system, and the impact they have on the overall ATM system.

4.1.1 ATC functionality affected

The methodology adopted in this thesis for the classification of ATC functionalities

results in a nine-category classification (Chapter 2, section 2.3). Several examples of

the equipment failures related to different ATC functionalities are presented in Table 4-

1. These examples are randomly selected and de-identified from operational failure

reports available in this research, as discussed previously in Chapter 3.

Table 4-1 Examples of equipment failures related to different ATC system functionalities (as defined in Chapter 2)

Type of failure Example

Communication function

Total radio telephony failure on three frequencies (three sectors). Workstation had to be reset to default fallback setting.

Navigation function

Runway 15 Instrument Landing System (ILS) failed whilst aircraft on 16 NM final approach in Instrument Meteorological Conditions (IMC). Approach Control Centre was advised and aircraft confirmed the failure. Aircraft was preparing for a missed approach, when the ILS returned to service after recovery.

Surveillance function Erroneous altitude readings displayed on radar for B777 and B767 at FL340 and FL350, respectively. Short term conflict alert (STCA) was activated.

Data processing function

Triple failure on suite flight data exchange. System fully recovered after 40 min by manual intervention. Departures from two airports were stopped for approximately 10min. The cause was the existence of duplicate flight identity numbers within the flight data held in the affected workstations.

Supporting function

B737 was on the final approach at 50ft over the runway when the controller received a false Approach Monitoring Aid (AMA) warning. The controller was concerned that in low visibility conditions a go-around would have been unnecessarily given.

Safety nets (SNET)

STCA failed to activate against two aircraft at FL120. One aircraft was dropping parachutes, with the other filming them. Consequently, the aircraft were quite close to each other. They were both squawking Secondary Surveillance Radar (SSR) codes, but Short term Conflict Alert


71

(STCA) failed to activate.

Power supply

At time 0535 power failure in the tower caused Radar Data Processing System (RDPS) and Flight Data Processing System (FDPS), radar, public telephone network, weather radar, and computer failure. At time 0650 position rebooted and upgraded. ATC service returned to normal at 0730.

Pointing and input devices

Cursor frozen in global ops field of electronic flight strip. The controller was moved to an adjacent console and resumed operations from that position. There was only a brief interruption to the service.

System monitoring and control function

At 0215 the ATC system suffered a significant slowdown. The System Monitoring (SMS) shut itself down.

4.1.2 Complexity of failure type

Failures can be single or multiple component failures (Wickens et al., 1998). A single

failure can be total or partial affecting only one piece of equipment or one of its

components. Multiple component failures can be independent of each other (which can

make the process of diagnosis very difficult) or dependent failures (common cause,

common mode, or cascade failures) (Mauri, 2000). Common cause failures occur when

a single cause creates simultaneous (or near simultaneous) multiple failures (e.g. due

to fire, loss of power, or software bug). Common mode failures are a subset of common

cause failures whose observed effect on the system is identical. Cascade failures are

dependent failures that affect redundant components by shifting their load sequentially

(e.g. power grids or servers). Once the first level of redundancy is pushed beyond its

capacity (e.g. transformer), the load will be shifted onto the next redundant component

until all redundancies are exhausted (Mauri, 2000).

4.1.3 Time course of failure development

In terms of time course of failure development, there are sudden, gradual, or latent

failures. With sudden failures, the operator does not have much time to prepare for

recovery, but at the same time there is the potential advantage of immediate detection

of the failure. Contrary to this, gradual failures may degrade system capabilities in ways

that are not apparent to the operator (e.g. gradual loss of data integrity). This makes

failure detection, and therefore technical and human recovery extremely difficult. Latent

failures are generally difficult to detect. These failures exist in the system unnoticed

until the occurrence of some other failure or unusual occurrence reveals long-existing

latent failures in the system (Wickens et al., 1998). As a result, this group of failures is

observed separately, as the time course of their initial development is not known, i.e.

these failures could occur initially either as sudden or gradual.


72

4.1.4 Duration of failure

Duration of failure is defined as the time between the first log of the event (corresponds

closely to the failure detection) until its final closure. Applied to a specific failure, it can

carry important information on recovery and its impact on ATC, ATM, and overall

aviation safety. The categories defined in this research are based on the evidence from

the available operational failure reports. Their analysis indicates the distribution of

failure duration which corresponds to the following categories (section 4.4.6):

� Short period of time - order of magnitude is in minutes;

� Moderate period of time - order of magnitude is in minutes up to one hour; and

� Substantial period of time - order of magnitude is in hours (it can extend to days).

4.1.5 Potential causes of equipment failures

The causes of equipment failures come from the three interacting sources. These are:

� Technical faults as defects or anomalies built into the system or its components;

� Human errors or violations as acts of omission or commission by the designer,

constructor, controller, engineer, or maintenance personnel that might result in a

failure; and

� External factors or unfortunate, unforeseen, or uncontrolled events, such as severe

weather, fire, accidents, vandalism, sabotage, or terrorism.

The listed causes of failures represent only the first layer of causation. Further analysis

might reveal the existence of organisational error, organisational loss of control, or

failure to anticipate all hazardous conditions and prepare appropriate defences against

them. As an example, the impact of a power outage should be anticipated by

management and consequently appropriate preventive strategies should be

implemented. Similarly, the threat of either terrorism or vandalism should be guarded

against through the provision of adequate internal security measures.

There are various techniques designed to investigate technical faults, human error, and

organisational error. For technical faults, Fault Trees (FT), Event Trees (ET), and

Probabilistic Safety Assessment (PSA) are mostly applied (Brooker, 2006); human

error is investigated by a range of Human Reliability Assessment (HRA) techniques

which are discussed in more detail in Chapters 7 and 8. Finally, organisational errors

are mostly investigated using the Reason model (Reason, 1997), the Human Factors


73

Analysis and Classification System-HFACS (Shappell, 2000), or qualitative principles

behind a safety culture (Sorensen, 2002).

After brief discussion of these five failure characteristics, the next section discusses the

potential consequences of equipment failures. The consequences of equipment failures

are discussed at several levels, from their impact on the individual (i.e. the controller),

the operations room, the ATC system, concluding with their impact on the ATM system

as a whole.

4.2 Consequences of equipment failure

Equipment failures that penetrate existing technical built-in defences and hence affect

controller performance (called hazards) are the main objective of the research

presented in this thesis. Therefore, the consequences of these failures are initially

assessed at the level of the controller, followed by the operations room, a given

airspace (i.e. the impact on ATC operations), and finally at regional level (i.e. the

impact on ATM operations).

4.2.1 Impact on air traffic controller

The impact of equipment failures on controller performance represents the focus of this

thesis, and as such will be assessed in detail in the following Chapters. One equipment

failure occurrence in the Lisbon ATC Centre highlights the impact that equipment

failures could have on the controller (Sampaio and Guerra, 2004). In this very busy

sector, a sudden failure of the Radar Data Processing System (RDPS) affected only

one radar track. This failure went unnoticed for 21 minutes until a traffic advisory by the

cockpit-based Traffic Collision and Avoidance System (TCAS) triggered an action by

the controller. The controller did suspect some problems prior to the TCAS alert

focusing only on human error in the input of relevant data (i.e. SSR code).

Unfortunately, the controller never considered the possibility of an equipment failure.

Post-incident investigation revealed that the cause of this failure was incompatibility of

the software developed for the installed radar with the software of the main ATC

system. However, the same investigation did not reveal why this failure affected only

one radar track and not all tracks informed by the same radar. This particular example

highlights how complex and severe an equipment failure can be.

4.2.2 Impact on operations room

The impact of equipment failures on the entire ATC operations room depends entirely

upon the failure characteristics in terms of the number of equipment/positions affected.


74

Another important factor is the overall ATC Centre architecture, since exposure to

failure varies greatly based on the interconnectivity of different equipment, the level of

separate channels (redundancy/variability), and failure complexity (single failure vs.

multiple failures). Based on operational experience (NATS, 2002) and ATC operations

room configuration, four categories can be differentiated. These categories range from

the impact on the entire operations room, several sectors, or only one sector. The

categories are defined as follows:

� All workstations/all sectors affected;

� A number of workstations/different sectors affected;

� Several workstations (within same suite)/one sector affected; and

� One workstation/one sector affected.

The proposed categorisation by NATS follows the severity of the impact of failures on

the operations room starting with the most severe failure (known as outage) to the least

severe type of failure (affecting only one workstation). In addition, each ‘suite’ is

responsible for a specific portion of airspace (i.e. sector) whilst each sector has a

declared capacity (expressed in terms of the number of aircraft in the sector in the peak

hour). As a result, the failure characteristic ‘impact on operations room’ is linked with

the number of aircraft exposed to the impact of equipment failure.

4.2.3 Impact on ATC operations

The impact of equipment failures on Air Traffic Control (ATC) service provision should

incorporate effects from an operational, safety, and financial perspective. In terms of

ATC operation, equipment failures could result in an inadequate ATC service, leading

for example to unexpected or increased delays in service provision (aircraft performing

holding procedures due to a failure of the Instrument Landing System – ILS during the

landing phase of flight), delayed arrivals/departures, and limitations in capacity due to

traffic flow restrictions or stopped departures/arrivals.

From the safety perspective, failures generate unavailability of certain ATC functions.

They also generate increased workload as a result of unexpected and highly stressful

failure occurrences increasing the potential for incident/accident occurrence. Vitally,

safety could be jeopardised by any type of data integrity equipment issue when the

equipment provides timely but inaccurate information. On such occasions, an

equipment failure could go undetected for some time (see the example discussed in

section 4.2.1). All of these, combined with inadequate or insufficient training, the


75

absence of recovery procedures, and a lack of experience may create the potential for

controller error.

From a financial perspective, equipment failures create planned and unplanned costs

of repair, training (of both controllers and technicians), and incident investigation.

However, the most likely costs are measured in terms of additional costs placed on

airlines in the case of significant delays (e.g. loss of connecting flights and passenger

accommodation). These are discussed further in the next section.

Ideally the combination of all three consequences of an equipment failure should

constitute the overall impact on ATC operations or the particular failure’s ‘severity’.

However, in the operational environment the most usual practice is to combine safety

and the operational impact of an equipment failure to determine its severity rating. The

following paragraphs review severity ratings defined specifically for equipment failure

occurrences. They originate from safety regulations defined in two Air Navigation

Service Providers (ANSPs) and one Civil Aviation Authority (CAA).

The UK National Air Traffic Service (NATS) recognises four categories of failure types

based on their impact on ATC operations, namely major impact, impact on workstation

or suite, ATC impact, and minimal impact (Table 4-2). Furthermore, analysis of

operational failure reports in this thesis identified the severity categorisation from one

CAA (referred to as Country C) and another ANSP (referred to as Country D). The CAA

of Country C defines the severity rating of equipment failures according to the potential

to cause a significant problem (see Table 4-3).

Table 4-2 UK NATS severity rating (from NATS, 2002)

Severity Definition

Major impact to Ops room

Severe flow restrictions could be required

Impact to workstation/suite

May be necessary to combine/move positions immediately or sector flow restrictions may be required

ATC impact Not immediately critical, will have greater operational impact over time

Minimal impact Centre management required


76

Table 4-3 Country C’s severity rating as defined by its CAA

Severity Factor Definition

CR Critical An occurrence or deficiency that caused, or on its own had the potential to cause, loss of life or limb.

MA Major An occurrence or deficiency involving a major ATC system component that caused, or had the potential to cause, significant problems to the function or effectiveness of that system.

MI Minor An isolated occurrence or deficiency not indicative of a significant ATC system problem.

Finally, the data for Country D originate from one particular ATC Centre. This Centre

determines the severity of an incident as a result of the combination of the impact it has

on both the controllers (internally in this ATC Centre as well as externally in other ATC

units) and system control and monitoring engineers. In general, in this particular ATC

Centre the determination of the severity of an incident is the task of the system control

and monitoring unit which distinguishes five severity classes. These are presented in

the Table 4-4.

Table 4-4 Country D severity rating as defined by the particular ATC Centre

Severity Factor Definition

1 System down A system outage affecting the total of ATC services provided

2 Critical An error severely affecting a single or few random working positions or a single external service or an error on a “first” standby system.

3 Urgent An error affecting part of a single or few random working positions or part of an external service or an error on a backup system reducing backup capacity.

4 Important An error affecting a supportive service or a system for which automatic backup is available.

5 Enhancement An error having no direct operational impact and only slight non-operational impact.

These severity rating schemes indicate that each country follows its own severity index.

Furthermore, there is a difference in severity ratings between ANSPs and CAAs, as

ANSPs are concerned about the impact on their service provision business (e.g.

delays), whilst safety regulators are concerned about whether such an event causes an

accident. Therefore, simply comparing the severity of occurrences between countries is

unlikely to produce useful findings. All classifications are rather qualitative and depend


77

upon experience and judgement, which always involves a degree of subjectivity. As a

result, it is necessary to define a unique severity classification for the entire dataset

available in this study corresponding to the existing equipment failure severity ratings

(UK NATS, Country C, and Country D). Consistent with operational practice, the

severity rating defined in the following paragraphs combines safety and operational

impact of equipment failures, while disregarding the financial aspect due to lack of

data. Since the focus of this thesis is on the impact of equipment failures on ATC

operations (including its impact on controller performance), the exclusion of the

financial aspect of severity rating does not have a detrimental effect on this severity

rating and the subsequent quality of data analyses.

The result is a three-level severity rating (major, moderate, and minimal) of equipment

failures based on their impact on ATC operations, as would be appreciated by the

controller (Table 4-5). It is important to highlight that this severity categorisation is

based on the exposure of an ATC Centre to the failed equipment (affecting the entire

ATC Centre, a number of workstations, or only the backup system) regardless of the

type of service provided by the affected ATC Centre. The significant difference in the

level of detail in the reports and the overall need for a consistent approach led to the

exclusion of the type of ATC service in the overall severity categorisation. This

characteristic is accounted for later on in the thesis through the assessment of the

recovery context surrounding an equipment failure occurrence. As a result, this

exclusion here does not have detrimental effect on the severity rating and the

subsequent quality of data analyses. In general, the severity rating is based on the

failure type, available contextual conditions of the failure occurrence, and its impact on

ATC operations.

Table 4-5 Severity rating defined in this research and mapped with available sources

Severity rating in

this research

Definition of the severity rating in this research

Mapping with severity ratings from available

research

Major

Definition: This type of failure may cause severe disruptions on every workstation. It may require immediate traffic flow restrictions to contain workload to manageable levels, which are safe for sustained ongoing operations.

Major

(UK NATS)


78

Examples: loss of main Flight Data Processing System (FDPS), total voice communication outage, loss of Multiple Radar Processing (MRP), loss of Terminal Approach Radar (TAR), loss of Parallel Approach Runway Monitor (PARM), loss of radar coverage, either complete or over larger parts (Primary Surveillance Radar - PSR and secondary surveillance radar - SSR), total power failure, loss of all Radio Telephony (RT) frequencies, incorrect barometer indication (as part of meteorological equipment), Instrument Landing System (ILS) failure during approach phase and in the reduced visibility conditions, failure of runway/taxiway lights in reduced visibility conditions, wrong indication of runway/taxiway lights, Surface Movement Radar (SMR) failure or provision of wrong label indication.

Major

(Country C)

1

(Country D)

Moderate

Definition: Only affects workstations reliant on the failed item or service. The disruption of ATC operation is contained and a normal level of operation may be resumed by physically moving and combining the role of the affected workstations with another within the sector suite or by physically moving the sector team to the stand-by suite. Under some conditions, sector flow restrictions may be applied.

Impact on workstation/suite

(UK NATS)

Examples: loss of single sector frequency, loss of a number of frequencies, loss of one or two workstations in a sector suite, loss of entire sector suite, loss of telephone panel or Voice Switching And Communication System (VSCS) on a single workstation, loss of one radar (in multiple radar environment), loss of ground-based navigational aids (e.g. Very high frequency Omnidirectional Range - VOR, Non-Directional Beacon - NDB, Distance Measuring Equipment - DME), loss of PSR (as it is a backup to SSR), SSR garbling, loss of safety nets (as these are only tools to support controller).

Major

(Country C)

2 and 3

(Country D)

Minimal

Definition: Initial disruption to ATC operations is not immediately critical, but could have greater impact over time (If not recovered within a reasonable time frame, disruptions to ATC operations may be prolonged/sustained). This escalation with time can restrict traffic flow into sector(s).

ATC and minimal impact

(UK NATS)

Examples: loss of processor, loss of link, loss of system control and monitoring unit, loss of headset, ILS failure during approach in normal visibility conditions because the opportunity for go-around always exists, failure of runway/taxiway lights (in normal visibility conditions) as this system is only a visual aid to the instrument landing, failure in communication link to adjacent ATC Centre, loss of auxiliary display, temporary failure of strip printer or paper jam, inadequate strength of RT frequency, failure of left hand headset connector while right hand is functioning, disturbance/interference on a ground frequency, loss of sequencing tool, and loss of pointing/input devices.

Minor

(Country C)

4 and 5

(Country D)

Having defined the three-level severity rating to be used in this research, appropriate

mapping is established with the existing severity ratings (as defined by UK NATS, the

CAA of Country C, and the ANSP of Country D). The comparison of specific categories

from each of the available sources reveals the matching with ‘major’, ‘moderate’, and

‘minimal’ ratings as defined in this research (Table 4-5). Note however that the ‘major’

category, as defined by Country C, had to be split between ‘major’ and ‘moderate’

categories, as defined in this research. The rationale behind this split is based on two


79

criteria of equal importance. The first criterion is the definition of ‘major’ and ‘moderate’

categories as presented in Table 4-5. In other words, the severity rating has to

distinguish between failures that affect the entire ATC Centre and those that affect only

workstations reliant on the failed item. The second criterion is based on the impact of a

failure on ATC operations. For example, loss of a VOR or NDB is rated as ‘moderate’

because navigation may be still provided using radar surveillance, other navigational

aids (Global Positioning System-GPS, Automatic Dependence Surveillance-ADS).

However, loss of an ILS during the approach phase or in reduced visibility conditions is

rated as ‘major’. During this phase of flight the aircraft is in the landing configuration

(i.e. reduced speed, in close proximity to the ground). If visual contact with ground is

not achieved at the moment of the failure, an immediate go-around procedure is

necessary. Because of this, the failure of an approach navigation aid (such as ILS) is

considered more severe.

4.2.4 Impact on ATM operations

As noted earlier, it is highly beneficial to analyse the impact of the failures on

operations both inside the control room and outside over a given airspace. At the same

time, it is also important to recognise that failures could have an impact not only on

ATC but also on the wider ATM system. The following examples show how severe the

impact of an equipment failure on ATM operations can be.

According to Aviation Week (reported in RISKS, 2000; NATS, 2004), the UK ATC

service suffered a flight data processing software failure at West Drayton ATC Centre

in June 2000. As a result of the failure, flight progress strips had to be hand written,

which forced the ANSP to restrict the amount of traffic in UK airspace. While the ATC

system recovered after four hours, the effects of this failure were felt for several days

with knock-on effects as far as France and Germany. This is understandable due to the

centralised flow control of traffic in Europe (provided by the EUROCONTROL Central

Flow and Management Unit). As a result of the failure’s severity and subsequent flow

control, its impact spread over a sub-continental region.

Another example of a failure with a severe impact on a wide region is the brief power

failure which affected the US Federal Aviation Administration (FAA) Southern California

Terminal Radar Approach Control (TRACON) facility at Miramar on April 19, 2006. The

facility switched immediately to backup power. The outage lasted only 6 or 7 seconds,

but had an impact on airports from the Mexican border and half way through the state

of California, due to imposed traffic flow control (10News, 2006).


80

Another example of the severe impact that one single failure can induce is the outage

that occurred in the Chicago ATC Centre in 1995 when the en-route automation

component failed for two hours. This single occurrence cost the airlines an estimated

$12 million in delays (National Transportation Library, 1997). The National

Transportation Library (NTL) report mentions this example to make a case for the

replacement of the outdated main and back up Flight Data Processing Systems

(FDPS), involved in the reported incident. In short, these examples show how severe

the impact of an equipment failure on global ATM operations can be. This issue will

become especially important in a future gate-to-gate ATM system where the roles for

planning and control will have to be re-organised and distributed between controllers

and pilots.

Similar to ATC operations, the impact of failure on ATM can be analysed from several

different perspectives. From operational and safety perspectives, a higher degree of

workload will be experienced both on the ground by controllers, technicians, and

engineers and in the air by flight crew. From a financial perspective, in addition to costs

identified in ATC, it is necessary to add the cost of delays in a wider region. A small

exercise has been conducted on the cost of delays induced by ATC equipment failures

to indicate the financial impact of delays in the European Civil Aviation Conference

(ECAC) and US airspace. This is presented in Appendix I.

Having discussed the consequences of equipment failures, it is important to discuss

how such consequences could be prevented or mitigated. This involves the process of

recovery from equipment failure and a distinction can be made between technical and

human recovery. The following section focuses on technical recovery and the principles

used to prevent and in some cases to mitigate the impact of equipment failures. The

human recovery aspects are addressed in Chapter 5 and throughout the rest of the

thesis.

4.3 Definition of technical defences (technical recovery)

The aim of any design is to identify the functions of a system in advance and to

develop a method which assures the delivery of the intended functions. It is always

necessary to predict what may happen if something fails or if an operator handles a

system incorrectly. Experience shows that even the best designed systems fail

occasionally. Therefore, it is crucial that every design concept includes a solution to re-

establish system operation and provide continuous service. These solutions are


81

grouped under the term ‘technical built-in defences’. They represent defences against

any unplanned or unwanted interruption of service. They are complex socio-technical

systems which combine technical, human, and organisational measures that prevent or

protect against an adverse effect (Smith et al., 2004). Verification of the existence and

appropriateness of existing defences provides confidence in the safety of a system and

is a requirement for system certification.

Safety is recognised as the ultimate imperative in ATC and therefore, should be

addressed as early as possible in the design process. Having sound safety principles

built into each phase of the design (i.e. conceptual, preliminary, and detailed design

phase) is a useful way to avoid, prevent, and mitigate failures and their impact. Safety

through design is planned through five different principles (Figure 4-1) for hazard1

avoidance, elimination, or control, which are as follows (Christensen and Manuele,

1999; National Aeronautics and Space Administration, 2002; The European New

Machinery Directives cited in Piantek, 1999):

� Eliminate hazards;

� Design for minimum risk;

� Incorporate safety devices (i.e. devices designed to prevent any unwanted event);

� Provide warning devices (i.e. alert that signals the occurrence of some unwanted

event); and

� Develop operating procedures and training schemes.

Figure 4-1 Safety through design (adapted from Christensen and Manuele, 1999)

1 Within system safety, a hazard is usually defined as a condition which can lead to an accident.

In this research, a hazard is defined as the ATC system state resulting from an equipment failure that penetrates all existing technical defences and affects the ability of the controller to perform his/her tasks.


82

The suggested principles follow the logical order of precedence. The first two

approaches focus on the elimination of the hazard from the system. However, if the

identified hazards cannot be eliminated (due to difficulties or cost), risk should be

reduced by using fixed, automatic, or other protective safety devices (i.e. defences for

seamless recovery from failure). When neither design nor safety devices can effectively

eliminate identified risks or adequately reduce them, devices should be used that

detect the unwanted condition and produce adequate warning signals to alert the

controller (i.e. defences for transmitting information regarding a failure). These warning

signals should be designed to minimise the probability of inappropriate human reaction

and response. Note that regardless of how a warning device performs (Figure 4-2), the

triggering failure represents a hazard (according to the definition in this thesis) as it

affects controller performance.

As explained before, the human operator remains the last line of defence (i.e. human

recovery). For this reason, when warning devices are not sufficient, special procedures

and training scheme should be designed. These must be periodically tested, verified,

and regularly updated to assure their effectiveness.

Similarly, when dealing with equipment failures in ATC, it is important to distinguish

between technical and human (i.e. controller) recovery (Figure 4-2). Both processes

start with the detection of failure (either by a technical system or controller) and

conclude with an outcome. The outcome can be nominal (pre-failure), non-nominal but

stable (i.e. degraded), or inadequate system state (leading to incident or accident). The

outcome of the equipment failure and recovery process is discussed in detail in the

following Chapter. The following paragraphs focus on technical recovery, while human

recovery is addressed in subsequent Chapters.

Figure 4-2 Technical and human recovery

As already highlighted, technical built-in defences can be divided in two different

categories according to the function they provide. These are defences for recovering

from failures (safety devices) and defences for transmitting relevant information on


83

failure (warning devices). Both categories are examined further in the following

sections.

4.3.1 Defences for recovering from failures (safety devices)

This group of technical built-in defences should include mechanisms designed to

prevent an unwanted event or safety devices (e.g. radiotelephony anti-blocking device,

availability of primary and secondary frequency, automatic switching from normal to

fallback operational mode, automatic switching from primary to secondary glide slope

transmitter) and the creation of fault-tolerant systems though redundancy/diversity. The

main objective of built-in defences is to prevent adverse events from happening (i.e.

preventive defences) or to lessen the impact of the consequences on operations (i.e.

mitigative or protective defences). If a failure has only a preventive barrier, there is no

fault tolerance in the system, as achieved by protective defences. For example, the

feasibility study of the EUROCONTROL eight states free route airspace concept was

established to ensure that free route airspace operations are as safe as the current

fixed route operations (EUROCONTROL, 2001c). The analysis identified 128

preventive defences but no protective defences. Therefore, this concept, in its current

state, fails to establish fault tolerance in the ATM system.

Fault-tolerant systems are designed to preserve the minimum required service in spite

of failure occurrence. This is achieved through the employment of redundancy.

Redundancy is an ability of a system to keep functioning normally in the event of an

equipment failure, by having backup components that perform duplicate functions

(Mauri, 2000). The goal of this process is to mask failure events from the controller, but

also to capture it and report it for the necessary maintenance. However, redundancy

itself is not always a solution due to common cause failures (e.g. fire or power outage).

Common cause failures are due to the same cause. In order to prevent the occurrence

of these types of failures emphasis is placed on diversity of the systems (i.e. different

manufacturers), equipment diversity in manufacturing (e.g. different software

packages), and/or functional diversity (e.g. physically independent components,

redundant hydraulic system lines of commercial aircraft are physically separated so

that fire in a certain compartment does not affect all the lines simultaneously).

4.3.2 Defences for transmitting information on failure (warning devices)

Alerts should be provided to the controller in the event of a critical change in the ATC

system or equipment status and to remind him of critical actions that must be taken. An


84

alert or a warning should enhance the probability of appropriate human reaction and

response (i.e. controller recovery performance). According to the FAA’s Human Factors

Design Standard (Federal Aviation Administration, 2003) warning devices should:

� Alert the operator to the fact that a problem exists;

� Inform the operator of the nature of the problem;

� Guide the operator’s initial responses (based on priority); and

� Confirm in a timely manner whether the operator’s response corrected the problem.

Alerts are usually generated immediately after the system detects any discrepancy

from predefined system performance. There are several ways in which ATC controllers

are informed of equipment failures or non-availability of certain functions. The most

usual ones are through colour-coding (e.g. change in the workstation’s border colour)

and textual messages, all presented on the Human Machine Interface (HMI). In

addition to the content and location of the alert message, it is equally important to

display an alert in a timely manner. Alert onset is defined as time between a system’s

detection of a failure and the moment an alert is presented on the HMI either by colour

change or text message (i.e. time-to-alert or TTA). This timing is usually system-driven

(based on the system threshold) but there are novel initiatives toward human-driven or

cognitively-driven alert onset. In general there are three different types of alert onset:

� Immediate onset (an alert is presented on the HMI after the system detects the

failure with the least time delay). This is the normal case for severe events.

� Delayed onset (an alert is presented on the HMI with a time-based or threshold-

based onset). For example, system requirements could be set up to inject an

alert with a specific time delay following the occurrence of a failure or to inject an

alert once a system-defined threshold has been reached (i.e. TTA). In the nuclear

industry this is known as alert sequencing or alert hierarchies indicating the

urgency of actions needed. In this way, a hierarchy makes use of safety criticality,

injecting firstly safety-relevant alerts followed by operational alerts. In satellite

navigation, the TTA value is one of the measures of the integrity of a satellite

navigation system (Feng et al., 2005).

� Cognitively convenient onset (an alert is presented on the HMI based on

cognitive convenience which can be defined thorough the levels of controller

workload). This futuristic concept is mostly used in the nuclear and automobile

industry where cognitive convenience is determined by measuring workload

using physiological measures (e.g. heart rate, breathing rate, galvanic skin

response, eye tracking device). This concept has been tested on a US naval ship

as described in Daniels, Regli, and Franke (2002). This study proposes a method


85

to control the cognitive effects of task interruption by influencing the timing of an

alert and helping a user to regain their situational awareness within the

interrupted task.

After a detailed overview of the equipment failure characteristics as well as technical

recovery, the next section analyses the nature of equipment failures that manage to

penetrate the existing built-in defences and affect controller performance. For this

purpose, findings from existing literature have been augmented by results of the

analysis of more than ten thousand operational failure reports originating from four

different countries. This sample of equipment failure reports have already been

introduced in Chapter 3 and the following section further analyses this sample.

4.4 Analyses of operational failure reports

Existing literature on equipment failure characteristics has been reviewed in the

previous sections of this Chapter. This has been further augmented and informed by

the analyses of operational data from four countries (i.e. Countries A, B, C, and D), as

presented in detail in Chapter 3.

4.4.1 Data analysis methodology

Since the four countries are of different airspace size, equipage, traffic demand, and

density in their airspace, simple analysis of equipment failure rate would be of limited

value. Therefore, to gain a common metric to assess distribution of equipment failures

per year and per data source, it is necessary to normalise the rates of equipment

failures per appropriate unit of measurement. For example, the rates per ATC Centre

enable comparison of ATC Centres of similar traffic demands and thus equipage, but

otherwise fail to provide a meaningful performance measure. Similarly, the rate of radio

frequency failure per sector or per total number of available frequencies in a sector

(usually there are primary and secondary frequencies available in a sector) enables a

metric for the availability of voice communication in each sector. However, this unit is

not of practical use as the number of sectors changes hourly based upon changes in

air traffic demands. As a result, the rate of equipment failures per flight hours is used in

this research2. This approach avoids difficulties and differences associated with the

2 Hours flown data are collected for commercial airlines, including domestic, regional, and

international air traffic for each country.


86

geographical coverage of the datasets available and the availability of ATC systems

and equipment (e.g. number of radars, navaids, communication systems).

The information on flight hours for each country has been extracted from the CAA

websites, annual incident summaries, and personal correspondence with the staff from

the engineering unit. After establishing the common ground with an appropriate unit of

measurement, further analyses are performed with available data structured around

four equipment failure characteristics, as they were possible to extract consistently

from available datasets. These four equipment failure characteristics are: type of ATC

functionality and equipment affected, complexity, severity, and duration3 of equipment

failures. The type of equipment/ATC functionality affected and complexity of failure type

are extracted from the short summary available for each report. The severity of

equipment failure is extracted using the available severity rating (if it existed) or

assessing the available information of the operational and safety impact of equipment

failure and thus applying the severity rating derived in this research (see Table 4-5).

The duration variable was available only in the Country D database. Finally, additional

statistical tests have been performed to identify any relationship between four

equipment failure characteristics. The structure of the data analyses is presented in

Figure 4-3.

The nature of the variables under consideration determined which statistical methods

could be used to analyse the data. As can be seen from their description in this

Chapter, most variables are categorical (type of equipment/ATC functionality affected,

complexity of failure type, and severity). Additionally, complexity of failure type and

severity variable have an ordinal character (assuming the ranking between possible

categories). Only duration represents a continuous or ratio scale variable4. This

variable is firstly investigated for its overall distribution, further to be split into categories

to extract information regarding failures of short duration (discussed in sections 4.1.4

and 4.4.6).

3 The duration characteristic is analysed last as it is available only in one database.

4 Variables can be either continuous or categorical. Continuous variables are numeric values on

an interval or ratio scale (e.g. age, income). Categorical variables can be either nominal or ordinal. Nominal variables differentiate between categories but do not assume any ranking between them (e.g. gender). On the other hand, ordinal variables differentiate between categories that can be rank-ordered (e.g. from lowest to highest).


87

Operational failure reports

4 Countries22,808 available reports

Country D

Country A, B, C, and D



Data pre-processing

Rate ofequipment failures

Type of ATC function and equipment

affected

Severity

Duration

Additional statistical tests

Available data

Country D database

Traffic figures from respective CAAs

ATC functional classification –Chapter 2

Severity rating –Chapter 4, Table 4-5

Reference

Country A, B, and CComplexity of failure type Chapter 4, section

4.1.2

Figure 4-3 Operational failure reports analyses

Using the SPSS statistical package, frequencies of related categories are identified and

the most frequent categories are reported for each variable. To establish relationships

between these variables, additional statistical tests are also performed. In this regard,

chi-square tests are used to test the relationships between two categorical variables.

The most important assumptions of the chi-squared statistical tests are random sample

data, a large sample size, adequate cell sizes (no less than 5 observations per cell),

independent observations, and normal distribution of deviations between observed and

expected values. The size and characteristics of the available datasets imply the

conformance with all listed assumptions. Furthermore, the Cramer’s V test is used to

measure the association for nominal data (i.e. ATC functionality variable) whilst the

Kendall tau test is used for ordinal data (i.e. severity and duration variables). These

tests are briefly discussed in the following paragraphs.


88

Cramer’s V is the chi-square-based test that measures the strength of the relationship

between nominal variables and is applicable across contingency tables of size greater

than 2X2 (Berenson et al., 2006). Cramer’s V coefficient is interpreted as a measure of

the relative strength of an association between two variables and it ranges from 0 to 1

(i.e. 1 representing a strong association). Suppose that the null hypothesis is that two

variables are independent random variables. Based on the frequency table and the null

hypothesis, the chi-squared statistic X2 can be computed as the squared difference

between the observed (O) and expected frequency (E) in each cell, divided by the

expected frequency. Then, Cramer’s V coefficient is defined in equation 4-1 below:

mn

E

EO

mn

XV

×

−

=

×

=

2

2)(

4-1

where n represents a sample size while m represents a smaller value between number

of rows minimised by one and number of columns minimised by one.

Kendall’s tau is a chi-square-based test that measures the strength of the relationship

between ordinal variables applicable across contingency tables of all sizes (Berenson

et al., 2006). Kendall’s tau coefficient has the following properties:

� If the agreement between the two rankings is perfect (i.e. the two rankings are the

same) the coefficient takes the value of 1.

� If the disagreement between the two rankings is perfect (i.e., one ranking is the

reverse of the other) the coefficient takes the value of -1.

� For all other associations the value lies between -1 and 1, and increasing values

imply increasing agreement between the rankings. If the rankings are completely

independent, the coefficient takes the value of 0.

Kendall tau coefficient is defined in equation 4-2 below:

1)1(

41

)1(2

1

2−

−

=−

−

=

nn

P

nn

Pτ 4-2

where n represents the number of pairs, P represents the number of concordant pairs.

In statistics, a concordant pair is a pair of a two-variable observation dataset {X1,Y1}

and {X2,Y2}, where (equation 4-3):

)sgn()sgn( 1212 YYXX −=− 4-3


89

Correspondingly, a discordant pair is a pair where (equation 4-4):

)sgn()sgn( 1212 YYXX −−=− 4-4

Sgn represents the sign function defined as (equation 4-5):

>

=

<−

=

0,1

0,0

0,1

sgn

x

x

x

x 4-5

Therefore, a high value of P indicates that most pairs are concordant, i.e. the rankings

are consistent. A tied pair (sgn x = 0) is not regarded as concordant or discordant. If

there is a large number of ties, the total number of pairs (in the denominator of the

equation 4-2) should be adjusted accordingly (Berenson et al., 2006).

After presenting the overall methodology used for data analyses, the following sections

present some of the key findings and results.

4.4.2 Rate of equipment failures

From Figure 4-4, the rate of equipment failures for Country A initially increases greatly

before peaking in 2002, followed by a sharp drop in 2003. This corresponds to a large

number of early failures experienced with the opening of the new ATC Centre which

accounted for 63.4 percent of all reported equipment failures in that year. Country B’s

rate rises from 17.5 failures per 100,000 flight hours in 2001 to 25 failures per 100,000

flight hours in 2002. This is followed by a drop to 17.8 failures per 100,000 flight hours

in 2003 before increasing sharply in 2005. The reason for high rates in 2004/2005 is

that the air navigational service provider directed controllers to be more diligent about

filling out incident reports to improve the quality of the incident database and the overall

safety management system. Country C’s rate exhibits a steady trend for the entire

period of 13 years, being on average nine failures per 100,000 flight hours.


90

0

5

10

15

20

25

30

35

40

45

50

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

20

01

20

02

20

03

20

04

20

05

Year

Rate

(in

100,0

00)

Country A

Country B

Country C

Figure 4-4 Total number of equipment failures per flight hours flown in each year for countries A, B, and C

The data available on the rate of equipment failures for Country D reveals a sharp rise

in number of equipment failures from 30 failures per 10,000 flight hours captured in the

last half of the year 2000 to 45 failures per 10,000 flight hours in 2001 (Figure 4-5)5.

The reason for this is that only five months of data was available for the year 2000.

Therefore, we can conclude that a rate of reported equipment failures in this ATC

Centre decreases in absolute numbers.

0

5

10

15

20

25

30

35

40

45

50

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

Year

Rate

(in

10,0

00)

Country D

Figure 4-5 Total number of equipment failures per flight hours flown in each year for country D (year 2000 incomplete)

5 Although the rates of equipment failure of Country D are tenfold higher compared to Countries

A, B, and C, Country D data are retained for subsequent analyses as they represent the most detailed and reliable source of operational failure reports.


91

The next section builds on this trend analysis and assesses affected ATC

functionalities. The classification of all ATC functionalities, as defined in Chapter 2, has

been used for this purpose and the findings are presented for each Country separately.

4.4.3 Type of ATC functionality and equipment affected

This section provides the analysis of ATC functionalities and their sub-functions

affected by equipment failure occurrences as reported for Countries A, B, C, and D.

Country A data shows that the two ATC functionalities most affected are the

communication and surveillance functions (Figure 4-6).

Figure 4-6 Most affected ATC functionality (Country A)

Further analysis of sub-functions and equipment most affected by failures identified the

following five types: air ground communication, secondary surveillance radar (SSR),

flight data processing system (FDPS), primary surveillance radar (PSR), and other

communication systems, ranging from pagers, headsets, microphones, cables, to

footswitches (Table 4-6).

Table 4-6 Most affected ATC equipment (Country A)

ATC equipment affected Percentage

air ground communication 33.1

secondary surveillance radar (SSR) 17.7

flight data processing system (FDPS) 10.1

primary surveillance radar (PSR) 5.2

other communication systems 4

Similar to the previous case, two ATC functionalities for Country B most affected by

equipment failures are the communication and surveillance functions (Figure 4-7).


92

Figure 4-7 Most affected ATC functionality (Country B)

Table 4-7 presents five types of equipment most affected by failures. These are: PSR,

air situational display or radar display, air ground communication, voice switching

communication system (VSCS), data exchange network, and runway/taxiway lighting.

Table 4-7 Most affected ATC equipment (Country B)


primary surveillance radar (PSR) 17.2

air situational display 15.1


voice switching communication system (VSCS)

8.8

data exchange network 7.6

runway/taxiway lighting 7.6

Country C shows a slightly different trend in the distribution of equipment failures per

ATC functionality. The two most affected categories are the navigation and

communication functions (Figure 4-8).

Figure 4-8 Most affected ATC functionality (Country C)


93

Furthermore, the five most affected equipment types are: air ground communication,

instrument landing system (ILS), very high frequency omnidirectional radio range

(VOR), non-directional beacon (NDB), and air situational display (Table 4-8).

Table 4-8 Most affected ATC equipment (Country C)



instrument landing system (ILS) 19.6

very high frequency omnidirectional radio range (VOR)

7.6

non-directional beacon 6.5


Country D shows a similar trend to Countries A and B, as two most affected ATC

functionalities are communication and surveillance (Figure 4-9). Although the

navigation function seems not to be represented at all in Figure 4-9, there were only

two failures affecting this functionality and both are due to testing of Global Positioning

System (GPS) clock alarms. The reason for the under representation of this ATC

functionality is the fact that data originated from one particular ATC Centre that

provides area control service and as such is not responsible for the ground-based

navigational aids and airport-based equipment (e.g. meteorological equipment,

runway/taxiway lighting, ILS, Surface Monitoring Radar-SMR).

0

500

1000

1500

2000

2500

3000

3500

Com

muni

catio

n

Nav

igat

ion

Surve

illanc

e

Dat

a pr

oces

sing

Suppo

rting

Safet

y ne

ts

Power s

uppl

y

Point

ing/

input

Syste

m m

onito

ring

ATC functionality

Fre

qu

en

cy

Figure 4-9 Most affected ATC functionality (Country D)

Further analysis of data for Country D shows that the following five equipment types

are most affected by equipment failures: air situational display (radar display), data

exchange network, air ground communication, other surveillance systems (mostly

referrers to radar links), and other communication systems, such as pagers, headsets,

microphones, cables, and footswitches (Table 4-9).


94

Table 4-9 Most affected ATC equipment (Country D)



data exchange network 15.7


other surveillance systems 8.7

other communication systems 4

Table 4-10 collates the five ATC equipment types most affected by failures, from each

available dataset. Findings are structured according to the ATC functionality they

support (in rows) and sources (in columns). Overall it can be concluded that Countries

A, B, and D are quite similar in relation to the most affected ATC functionalities. Results

of data analyses from these three countries indicate that failures mostly affect the

communication and surveillance functionalities. On the other hand, results of data

analysis from Country C differ as failures mostly affect the navigation functionality.

These are mostly failures of ILS, followed by failures of VOR, NDB, DME, as well as

airport lighting facilities (runway and taxiway lighting). Furthermore, the only equipment

type frequently affected by failures in all four countries is air-ground communication.

Other equipment types common in available datasets are air situational display, radar,

data exchange network, and supporting communication system (e.g. pagers, headsets,

microphones, cables, and footswitches).

Table 4-10 Summary of the five ATC equipment types most affected by failures

ATC functionalities

Country A Country B Country C Country D

Communication

A/G communication

A/G communication

A/G communication

A/G communication

other communication

systems VSCS

other communication

systems

data exchange

network

data exchange network

Surveillance PSR PSR

other surveillance

systems

SSR air situational

display air situational

display air situational

display Data

processing and distribution

FDPS

Navigation

runway/taxiway

lighting ILS

VOR

NDB


95

4.4.4 Complexity of failure type

As discussed previously in section 4.1.2 failures can affect single or multiple

components at the same time. The analysis of complexity of failure type was based on

extraction of the number of failures reported in each occurrence report, i.e. single or

multiple failures. It is assumed that failures that affect multiple components, regardless

of whether they are dependent or independent, were reported in the same operational

failure report. The personal correspondence with CAA staff in charge of the occurrence

databases from Countries A and B confirmed this assumption. According to them, if

two different items of equipment fail, but the time between failures is such that the

failure of one does not contribute to the failure of the other, then two 'single' failures are

reported separately. However, if the failures occur close together such that the failure

of one could have impacted on the failure of the other or, if unrelated, the fact that two

items failed close together meant that the controller workload is significantly increased,

then ‘multiple’ failures are reported in the same occurrence report. Based on these

findings, it was necessary to capture the frequency of reports that mentioned more than

one equipment failure. This was consistently done for Countries A, B, C, and D dataset.

Country C dataset has to be separately assessed due to the specifics of their reporting

system. In other words, in Country C, the database of each occurrence has multiple

records as they report separately each finding and cause. As a result, the assessment

of the multiple failure occurrences had to be performed by assessing each individual

case and completely avoiding all non-equipment failure reports. Similarly, Country D

dataset had to be completely ignored as the reporting system of the system control and

monitoring unit accounts for each failure independently. Table 4-11 represents the

percentage of multiple failures amongst the available operational failure reports.

Table 4-11 Percentage of the multiple failure occurrences reported in the available datasets

Country Number of reports with multiple failure

occurrences

Total number of reports

Comment

A 42 1378

B 206 1393

C 24 448 separate assessment due to the specific reporting system

D N/A N/A not applicable due to the specific reporting system

Aggregated data

272 (8.4%) 3219


96

Using the severity categorisation defined in section 4.2.3, it is possible to categorise all

available equipment failure reports from operational and safety perspectives. The

following section assesses the ATC functionalities affected by equipment failure with

respect to their severity or impact on ATC operations.

4.4.5 Severity of equipment failures

Figure 4-10 presents the distribution of equipment failures according to the severity of

their impact on ATC operations. As discussed previously, three severity ratings are

recognised, namely major, moderate, and minimal (Table 4-5). Although major failures

are the least frequent, their impacts on ATC operations and controller recovery

performance are the most severe. For this reason, the rest of the analysis focuses on

‘major’ equipment failures. The distribution of the ATC functionalities most affected by

major failures may be skewed due to the Country D dataset which does not incorporate

failures of the navigation functionality (see section 4.4.3). Future research should

address ‘moderate’ and ‘minimal’ severity categories as these are prone to errors of

controller recovery in the absence of written and practiced procedures.

Figure 4-10 Distribution of equipment failures according to their severity

The ‘major’ category accounts for 7 percent, 14.4 percent, 12.7 percent and 6.5

percent of the equipment failures within Countries A, B, C, and D respectively. These

results show the importance of assessing the degree of severity for each of the

equipment failure occurrences. For example, the majority of failures reported in the


97

Country D dataset tend to have minimal impact on ATC operations and controller

performance (Figure 4-13). However, if we observe only major equipment failures, or

failures that affect an entire ATC Centre or a major part of it, it is notable that the most

affected ATC functionalities are: communication accounting for 45.3 percent of all

aggregated equipment failure reports, surveillance accounting for 29 percent, followed

by data processing and distribution accounting for 15 percent (Figure 4-11).

System monPointing/inputPowerData procSurvNavComm

ATC functionalities

250

200

150

100

50

0

Fre

qu

en

cy

Country D

Country C

Country B

Country A

Country

Figure 4-11 Distribution of major equipment failures according to ATC functionality

Further, the major failures of the communication functionality are mostly due to the loss

of air ground communication or available frequencies and problems with data

exchange network (when used as a coordination channel). This is determined by

observing the frequency of equipment types that support the communication

functionality affected by a major failure. Using a similar approach, the frequency of

equipment types that support the surveillance functionality affected by a major failure is

determined. These are: air situational display and radar. Within the data processing

and distribution function, more than half of the major failures are due to one particular

piece of equipment, namely the Flight Data Processing System (FDPS). This particular

system handles flight plans, making them ‘live’ through automatic events, manual

inputs, and transitions from one state to the other. This information is provided via the

air situational display or radar display (Table 4-12).


98

Table 4-12 Summary of the five most affected equipment types from four datasets

ATC functionalities Major failures

Communication air ground communication

data exchange network

Surveillance air situational display

primary and secondary surveillance radar = loss of radar coverage

Data processing and distribution

flight data processing system (FDPS)

4.4.6 Duration of equipment failures

This section provides the distribution of equipment failures according to their duration.

As discussed previously in section 4.1.4, three categories are distinguished, namely

short period of time (order of magnitude in minutes), moderate period of time (order of

magnitude in minutes up to one hour), and substantial period of time (order of

magnitude in hours or days). This categorisation is informed by the characteristics of

the failure duration extracted from the Country D dataset as it is the only dataset which

has this information available. In general, the data shows that equipment failures could

last for a significant amount of time, i.e. the average duration being more than ten

hours (M=10.25h, SD=77.6h). This variable is measured from the first log of the event

until its final closure, which may have occurred some days later. This is the reason for

the significant spread of the duration variable around its mean. Data analysis revealed

that more than 600 failures lasted more than 24h. One particular failure of radar

telephone lines was particularly extreme in its duration as it was logged initially on

November 20, 2003 and closed on June 09, 2004, lasting more than six months.

Figure 4-12 shows the distribution of the failure duration according to the four

categories. It can be seen that the majority of failures last for less than one day, while

34.5 percent of equipment failures last up to 15 minutes (corresponding to short

durations). This particular category of equipment failures (short period of time) is

relevant to controller recovery. Equipment failures lasting up to 15 minutes require ad-

hoc thinking, use of past experience, training, and existing recovery procedures to

select and implement an optimal recovery strategy for the relevant contextual

conditions. Moreover, short duration failures lend themselves to experiment of

controller recovery, as presented in Chapter 9. Equipment failures lasting from 15

minutes to one hour belong to moderate duration category. Available data shows that

approximately 26 percent of equipment failures belong to the ‘moderate period of time’


99

category. The final duration category, substantial period of time, is further divided into

two additional sub-categories, failures that last up to one day and those that last longer

than a day. This is done to extract more information as about 40 percent of the

equipment failures belong to the ‘substantial period of time’ category. The results of the

analysis suggest that eight percent of reported equipment failures in Country D lasted

more than one day. Further investigation of equipment types affected by failures lasting

more than one day revealed that the majority of these are data exchange network

problems, air situational display, flight data processing system, links with radar sites,

and air ground communication.

[>24.01][1.01-24][0.26-1][0.00-0.25]

Duration category (h)

3,000

2,500

2,000

1,500

1,000

500

0

Fre

qu

en

cy

8.04%

31.6%

25.85%

34.51%

Figure 4-12 Distribution of the failure duration according to four distinct categories

Since this research addresses controller recovery from ATC equipment failures, the

focus is on ‘major’ failures within the ‘short period of time’ category. Table 4-13

presents the distribution of the major failures lasting up to 15 minutes, according to the

ATC equipment affected. It can be seen that the equipment most affected is the data

exchange network, followed by the other surveillance systems (mostly refers to radar

link), flight data processing system, air situational display, and air ground

communication.

Table 4-13 Distribution of major failures lasting up to 15 minutes per ATC equipment affected


data exchange network 28

other surveillance systems 16

flight data processing system 13.7


100

air situational display 12


4.4.7 Additional statistical tests

After the summary statistics presented for each of the datasets available and for four

relevant variables (ATC functionality, complexity of failure type, severity, and duration),

the final step is to test any interactions that may exist between these variables. The

ATC functionality variable is used because it has only nine categories, compared to the

ATC equipment variable which has more than 60 different categories. The rationale

behind the choice of statistical tests performed is explained in section 4.4.1. The results

are presented in Table 4-14.

Table 4-14 Statistical tests and results obtained

Country Variable 1 Variable 2 Test Statistical significance at 95

percent confidence level

Country A

ATC functionality

Severity Non-parametric test

(Cramer's V)

p<0.001

Country B p<0.001

Country C p<0.001

Country D

ATC functionality

Severity as above p<0.001

ATC functionality

Duration

as above p<0.001

Severity Non-parametric test

(Kendall’s tau) p=0.021

All statistical tests revealed significant relationships. For all available datasets there is a

significant relationship between the type of ATC functionality affected and the

equipment failure severity rating. The main findings from these tests indicate the

dominance of equipment failures affecting the communication and surveillance

functionalities with both minimal and major impact (see Table 4-15). The last test,

namely the relationship between failure severity and duration for Country D’s dataset

indicates significant negative relationship. In other words, the data indicates that the

longer the failure, the less severe it tends to be. This finding is expected as more

severe failures tend to be attended to immediately and thus the time between the first

log and closure of these failures may be shorter.


101

Table 4-15 Main findings regarding interactions between ATC functionality and severity

Country Severity rating

Major Minimal Country A surveillance communication Country B

communication communication and navigation

Country C navigation Country D communication and surveillance communication and surveillance

After qualitative and quantitative assessment of the equipment failures in ATC, the next

section derives a framework of the equipment failure impact assessment tool. This tool

is designed to assess equipment failures and provide an indication of their severity or

overall impact on ATC operations.

4.5 Qualitative equipment failure impact assessment tool

The ATC functionality classification defined in Chapter 2 is used as a basis for the

framework of the qualitative equipment failure impact assessment tool, as designed in

this research. This tool takes into account the proposed classification as well as the

failure characteristics relevant to controller performance. Thus, all previously defined

equipment failure characteristics must be examined for their relevance to ATC

operations. Table 4-16 provides the list of equipment failure characteristics relevant to

this tool. These are the type of ATC functionality provided by the failing system,

complexity of failure type, time course of failure development, and duration of failure.

Table 4-16 Review of equipment failure characteristics with regard to their impact on ATC operations

The inclusion of all failure characteristics in this tool except ‘ATC functionality affected’

is relatively straightforward. When including the characteristic ‘time course of failure

development’, out of three possible categories (i.e. sudden, gradual, and latent) the

category ‘latent’ was omitted. The reason for this lies in the fact that latent failures tend

Equipment failure characteristics Impact on ATC

operations Comment

ATC functionality affected √ To be considered

Complexity of failure type √ To be considered

Time course of failure development √ To be considered

Duration of failure √ To be considered

Impact on operational room x Output

Impact on ATC operations (severity) x Output

Impact on ATM operations (capacity, delays)

x Not relevant within the scope

of this research


102

to be overlooked in the overall ATC system for long periods of time until triggered by

some other failure. As such, they have a profound effect on the controller, but only

once they are triggered by other failure.

The ‘ATC functionality affected’ represents the key failure characteristics in terms of

effect on controller performance. It is significantly different if the controller is left to

operate without some key functionality (e.g. radar picture, communication, power

supply) as opposed to some auxiliary tools or equipment (e.g. monitoring tool, headset,

mouse). Therefore, it is necessary to separate ATC functionalities according to their

importance for the radar control of air traffic in a dedicated airspace. The separation is

intended to simply differentiate between primary and secondary ATC functionalities.

Their precise definitions informed by various examples are given in the following

paragraphs and Table 4-17.

Primary ATC functionalities are considered primary tools for achieving safe and

efficient flow of air traffic in any dedicated airspace. This group consists of the key

components, equipment, or tools of the communication, navigation, surveillance, data

processing, and power supply functionalities. These ATC functionalities are

categorised as primary ATC functions because they provide the critical information to

the controller. This critical information consists of: voice (and data) communication with

the aircraft in a dedicated airspace, aircraft horizontal and vertical position relative to

other traffic, and navigational directions or vectors to comply with the requirements of

the flight plan. These data are presented to the controller via an operational display

used for tracking the progress of multiple aircraft at any given moment. In modern ATC

Centres, the communication function is provided via the Voice Switching

Communication System (VSCS) touch panel (see Chapter 2 for more details). In

addition, it is necessary to highlight that the power functionality also represents a

primary function. This is a direct consequence of the computer driven ATC environment

where electrical power supplies all of the above mentioned systems. Therefore, in case

of any disruption (either from public utilities or an ATC Centre's own installation), the

controller may lose some or all primary functionalities. Table 4-17 captures the primary

ATC functionalities.

Secondary ATC functionalities (Table 4-17) represent supporting tools to achieve the

primary objective of the ATC service. Their function is important but not irreplaceable

by other, primary ATC functionalities. This group consists of: input/pointing devices,

system monitoring, safety nets, supporting ATC tools, as well as various components


103

of the communication, navigation, surveillance, and data processing functionalities. For

example, STCA, as a safety net, gained popularity out the past few years because of

its increased safety application and as a last ground-based technical defence against

mid-air collisions. Its sole purpose is to alert the controller to unsafe projected proximity

of two or more aircraft. Therefore, this system cannot be considered a primary function

in ATC but more of a supportive one. Furthermore, ATC tools, such as arrival and

departure managers, help sequence takeoff and landing of aircraft to provide the most

efficient utilisation of available resources (i.e. runway and airspace capacity). Overall,

without these tools, the controller may still provide the same functionality with

potentially less efficiency and increased workload.

Table 4-17 Detailed overview of the primary and the secondary group of ATC functionalities

ATC functionality

group ATC functionality

Sub-functionalities (equipment, sub-systems, tools)

Primary

Communication Air-ground Ground-ground Voice Switching Communication System

Navigation Instrument Landing System (ILS) (during approach phase and in the case of reduced visibility)

Surveillance

Primary Surveillance Radar Secondary Surveillance Radar Parallel Approach Runway Monitor Terminal Approach Radar Precision Approach Radar Air Situational Display

Data processing Flight Data Processing System Radar Data Processing System

Power supply Main power system Uninterruptible power supply(generator, battery)

Secondary

Communication

Data exchange network Back-up system Aeronautical Information Service Other

Navigation

Navigational aids (e.g. Very high frequency Omnidirectional Range - VOR, Distance Measuring Equipment - DME) Airport facilities control and monitor (navigation aids monitoring, aeronautical ground lighting)


104

Surveillance

Surface Movement radar Automatic Dependent Surveillance Aerodrome Traffic Monitor Other (radar link, radar console) Auxiliary Display

Data processing Flow control supporting equipment Fallback facility Other (e.g. strip printer)

Supporting function (ATC tools)

Monitoring aids Sequencing manager Other

Safety nets

Short Term Conflict Alert Minimum Safe Altitude Warning Area Proximity Warning Runway Incursion Monitoring and Conflict Alert System

Pointing and input devices

Pointing devices Input devices

System monitoring

Data recording and playback facility Control and monitoring Degraded modes Time management

Based on the selected characteristics of ATC equipment failures, it is possible to rate

the severity of each possible combination of characteristics. The three-level severity

rating defined previously, based on the impact of equipment failure on ATC operations,

has been used. This severity rating differentiates between major, moderate, and

minimal impact, as defined in section 4.2.3. In general, Figure 4-13 presents the

equipment failure impact assessment tool as a four-step methodology to assess the

severity of an equipment failure. After determining the exact characteristics of

equipment failure in each step, it is possible to follow the link to the final outcome, i.e.

severity rating.


105

Figure 4-13 Qualitative equipment failure impact assessment tool

The output of this tool is an assessment of the overall impact of an equipment failure

on ATC operations and consequently controller performance. The rationale behind the

severity ratings presented in Figure 4-13 is as follows:

� Loss of primary functionality tends to have moderate to major severity, depending

on other equipment failure characteristics (e.g. complexity of failure type) and

relevant contextual conditions (e.g. traffic). Moderate to major severity rating is due

to the fact that the primary ATC functionalities represent the critical tools for

achieving a safe and efficient flow of air traffic in any airspace.

� Loss of secondary functions tends to have minor to moderate severity, depending

on the additional variables such as complexity of failure type, time course of failure

development, and duration. Minor to moderate severity rating is due to the fact that

the secondary ATC functionalities only provide assistance for more efficient air

traffic control, but do not represent the systems without which the control of the air

traffic flow becomes unfeasible.

� Multiple failure occurrences may have a more severe impact on ATC operations

than a single failure occurrence simply because controllers have to cope with more

than one failure simultaneously.

� Gradual failures (e.g. gradual loss of data integrity) may have a more severe impact

on ATC operations than sudden failures (e.g. sudden loss of data).

� Duration of failure and severity rating tends to be inversely proportional. Data

analysis indicates that the longer the failure duration, the less severe it tends to

affect ATC operations and controller performance. The rational behind is that more


106

severe failures tend to be attended to immediately and repaired in a shorter time.

Moreover, if it is known that a certain primary functionality will not be available for a

considerable amount of time an ATC Centre may impose strict flow restrictions. For

example, strict flow restrictions may be imposed in the event of total failure of the

surveillance function (loss of primary and secondary radar). Partial failure would

allow traffic but at a restrictive flow rate (loss of secondary radar). Even if a

prolonged failure affects secondary ATC functionality (e.g. strip printer), the

controller working position will have to be closed. This is due to the disruption

caused by replacement of a previously automated task with manual input of flight

information for each flight entering a dedicated airspace. As a result, it seems that

the most severe impact can be expected mainly from short to medium duration

failures.

The emphasis of this research is on equipment failures which may have a major impact

on ATC operations, including an air traffic controller performance. Therefore, the output

of the qualitative equipment failure impact assessment tool in Figure 4-13 is useful for

selecting potential equipment failures of relevance to the research on controller

recovery (used to inform the experimental design in Chapter 9).

Considering this, the qualitative tool could be used in an operational environment in two

ways. Firstly, the left-to-right approach allows investigation of past equipment failure

occurrences and their impact on ATC operations. Secondly, using the right-to-left

approach this qualitative tool can be used as a method for design of the most severe

training scenarios. The training instructors could easily adjust the set of primary ATC

functionalities to the taxonomy of their systems/equipment and the characteristics of

the ATC system architecture. The qualitative equipment failure impact assessment tool

may be used as a design tool for the regular refresher unusual/emergency situation

training as recommended by EUROCONTROL ASSIST scheme (EUROCONTROL,

2003f).

The main disadvantage of the qualitative equipment failure impact assessment tool is

its inability to simultaneously assess the impact of several independent failures on

controller performance; rather it assesses one failure at a time as well as common

cause and common mode failures through the complexity of failure category. However,

previous research has already highlighted that multiple failure occurrences create

the highest workload (Wickens et al., 1997). As such, the current version of the

qualitative equipment failure impact assessment tool is sufficient for selection of the


107

most severe failure types, independent of each other. Future research should look into

the enhancement of this tool to enable the assessment of the impact of several

independent failures on controller performance. The output of this more advanced

approach would be to indicate the most severe independent multiple failure

combinations. However, to achieve this, the tool would have to be designed for a

specific ATC Centre to integrate the complexity of its ATC architecture and flow of data

between the various components of the ATC system.

4.6 Summary

In line with the objective of the research presented in this thesis, this Chapter has

identified potential equipment failure types and their key characteristics. Special

attention has been paid to the consequences of equipment failures and their impact on

ATC operations. A severity rating has been defined and applied to available operational

failure reports. The Chapter has further discussed technical recovery designed to

prevent or mitigate the impact of equipment failures on ATC operations and controller

performance.

Stepping away from theoretical findings from past literature, this Chapter has provided

operational input through the analyses of operational failure reports from four countries.

These analyses focused on four variables: the type of ATC functionality and equipment

affected by the failure, complexity of failure type, severity of its impact, and the overall

duration of the failure. Using the available reports it has been possible to identify

distributions of equipment failures in relation to these four variables. Although these

countries are different in terms of the volume and characteristics of airspace they

control, traffic levels, and equipment types; the analyses has shown that

communication and surveillance functionalities are affected most by equipment failures.

When observing only major failures, the most affected are the communication,

surveillance, data processing functionalities, and power supply. Further investigation of

major failures lasting a short period of time has revealed the most affected ATC

equipment. These are the data exchange network (as part of the communication

functionality), the flight data processing system (as part of the data processing

functionality), and air situational display (as part of the surveillance functionality).

The Chapter has concluded with development of a framework for the assessment of

the impact that every single equipment failure has on ATC operations. In general, the

knowledge acquired from equipment failure literature, informed by the analyses of

operational failure reports has been incorporated into the qualitative equipment failure


108

impact assessment tool and its severity output. These will inform the choice of

equipment failure and its characteristics for the experiment designed to assess

controller recovery.

The safety-critical industry is aware of the fact that hazardous equipment failures

cannot be avoided and that absolute safety is not achievable. Thus, the same attention

given to their analysis should be given to the overall human recovery process. Kanse

(2004) points out that “what we really want to prevent is not so much the failures

themselves, but the negative consequences of these failures.” As a result, the following

Chapter gives appropriate attention to the controller recovery process.

Chapter 5 Air Traffic Controller Recovery

109

5 Air Traffic Controller Recovery

The previous Chapter explained the characteristics of equipment failures and the

notion of technical recovery. This Chapter reviews the associated issues of the process

of controller recovery. In Air Traffic Control (ATC), the human recovery process

involves two groups of individuals. One group consists of controllers and the other

consists of system control and monitoring engineers1. The Chapter starts with a brief

discussion of the roles controllers and engineers have in the recovery process. As the

focus of this thesis is on controller recovery from equipment failures, the Chapter

continues with a review of past research of relevance to this subject. In this respect, the

Chapter reviews in detail the phases of controller recovery and the corresponding

models developed for the Air Traffic Management (ATM) and non-ATM industries. This

is followed by a discussion of the major factors that influence the quality of controller

recovery. The Chapter concludes by proposing a set of variables used for a detailed

assessment of controller recovery performance later in this thesis. This set of recovery

variables is also used as a guide to the design of the experiment to capture real data

on controller recovery in Chapter 9.

5.1 Human recovery in air traffic control

The human recovery process in the ATC environment involves two distinct groups of

individuals. One group is represented by air traffic controllers and can consist of a

single controller or a team of controllers depending on the configuration of the ATC

Centre and the traffic levels at any given moment. Engineers from the system control

and monitoring unit belong to the second group. This section gives a brief description

of the role of each group and the specific tasks to be executed to recover from

equipment failures in ATC.

1 Referred to as ‘engineers’ throughout the thesis.


110

5.1.1 Recovery by air traffic controllers

In the case of any equipment failure that affects controller performance (referred to as

a hazard in this thesis), controllers are responsible for recovering the system and

achieving a safe but not necessarily efficient level of operation. There are many human

factors issues that affect controller performance under normal conditions, and it is

reasonable to assume that the same factors are even more critical under abnormal

conditions, such as equipment failures. In other words, the context in which controller

performance takes place is important in understanding controller reliability. A detailed

review of contextual factors that may influence controller recovery and a methodology

for their potential influence on controller performance are presented in Chapter 7 and 8,

respectively.

While a recovery procedure may exist or not in the event of an equipment failure, most

ATC Centres have developed procedures for reporting and resolving such failures. Any

equipment failure should be reported to the supervisor, whilst those with operational

and safety impact must be reported under the mandatory occurrence reporting scheme

(for details see Chapter 3). Details of the failure are also forwarded to the system

control and monitoring unit.

When a failure has been rectified, the system control and monitoring unit notifies the

supervisor that the equipment has been restored to service. Then it is the duty of the

supervisor to inform the relevant sector staff and ensure that the restored equipment is

functioning correctly before updating the status of the failure in the database. In the

event that the system control and monitoring unit identifies a failure occurring in the

operations room, it is the duty of this unit to inform the supervisor who will subsequently

informs the controllers.

5.1.2 Recovery by system control and monitoring engineers

Failures are not necessarily detected only by controllers. Due to the layers of built-in

defences that exist in modern ATC Centres, the majority of equipment failures do not

affect the controller (NATS, 2002). These failures are detected by the technical system

and resolved by engineers from the system control and monitoring unit (e.g. by

receiving a system-generated alert and using redundant equipment, respectively).

EUROCONTROL (2004e) refers to an ATC system control and monitoring unit as ‘a

critical partner in maintaining ATC systems’. Engineers monitor and control equipment


111

that supports controllers. They reconfigure and maintain degraded or failed equipment

with minimum disruption to controller tasks and regularly upgrade the software as

operational requirements deem necessary. System control personnel have rapid and

reliable communication links with the ATC operations room via the supervisor. They

utilise this communication channel to inform ATC staff of the status and performance of

equipment and systems or to receive reports of technical problems and equipment

failures from the operations room. Therefore, EUROCONTROL (2004e) concludes that

recovering the ATC system from failure is a result of close coordination and

cooperation between controllers, technicians, and management.

Following this brief discussion of the roles and responsibilities of controllers and

engineers in the recovery process, the next section reviews the past research on the

human recovery process and its phases, developed for the Air Traffic Management

(ATM) and non-ATM industries. The main findings are then applied to a particular

process of controller recovery.

5.2 Phases of the controller recovery process

Existing literature on the human recovery process (either from human error or technical

failure) is largely based on the concept of a sequence of phases that constitute the

process of recovery. The human recovery process has become an important topic in

many areas of applied psychology, particularly in safety research in the chemical

industry (e.g. van der Schaaf, 1992; Kanse and van der Schaaf, 2000; and Kanse,

2004), the nuclear industry (Kaarstad and Ludvigsen, 2002), and the ATM industry

(Bove, 2002). Other examples include research on errors in the use of human-

computer interfaces (e.g. Kontogiannis, 1999; Rizzo, Ferrante, Bagnara, 1995; Zapf

and Reason, 1994), in the office environment (e.g. Frese, Broadbeck, Zapf, and

Prumper, 1990), in software design (Frese, 1991), and in the assessment of everyday

slips and mistakes (e.g. Sellen, 1994).

As can be seen from Table 5-1, there is consensus amongst researchers in various

domains to the existence of at least three phases of the human recovery process. A

few of the researchers, focused on the errors in the design of human-computer

interfaces, including a phase before the actual detection: the occurrence of an error

(Zapf and Reason, 1994) or the emergence of a mismatch (Rizzo et al., 1995), with the

latter being a precursor of the detection phase. The emergence of a mismatch involves

the discrepancy between feedback and active knowledge (active expectations or

implicit assumptions). Rizzo et al. (1995) discuss and explain the difference between


112

mismatch and detection processes through several examples of human error.

Mismatch is considered as a breakdown of the action-perception loop. However, only

after actual detection of mismatch will it be understood as an error or a failure.

From the detection phase onwards, some phases, including diagnosis and correction,

are recognised by most researchers even though sometimes different terminology is

used (Table 5-1). For example, the diagnosis phase is often referred to as the

explanation, localisation, or identification phase. Similarly, the correction phase is often

referred to as the handling, planning and execution, recovery, or countermeasure

phase.

Table 5-1 Phases of the recovery process identified in past research

Author(s) Context of research Phases of the recovery process

Frese (1991) Software design � Error detection � Error explanation � Error handling

Kontogiannis (1999) Human Machine

Interface

� Error detection � Error explanation or localisation � Error correction

Zapf and Reason (1994) Human Machine

Interface

� Error occurrence � Error diagnosis (detection +

explanation) � Error recovery (planning + execution)

Rizzo, Ferrante, and Bagnara (1995)

Human Machine Interface

� Mismatch emergence � Detection � Recovery

Sellen (1994) Assessment of

everyday slips and mistakes

� Error detection � Error identification � Error recovery

van der Schaaf (1992) Nuclear industry � Detection � Localisation � Correction

Kanse (2004) Chemical industry � Detection � Explanation � Countermeasures

Kaarstad and Ludvigsen (2002)

Nuclear industry � Detection � Explanation � Correction

Bove (2002)2 ATM industry

� Detection � Correction

Therefore, in the research on recovery from equipment failures presented in this thesis,

past research is used to inform the phases of the controller recovery process.

2 Bove (2002) does not identify the diagnosis phase in the human error management process.

This may be due to the fact that this phase represents a covert human activity, difficult to observe, measure, and capture in incident reports.


113

Detection of equipment failure is taken as the first phase, triggered by the mismatch

between ATC system feedback and active knowledge of the controller (expectation or

assumption). This phase is followed by the diagnosis and correction, leading toward

the outcome of the recovery process (as a result of both technical and controller

recovery).

Controller recovery is defined in this thesis as the ability of the controller to detect3,

diagnose, and correct any non-nominal system state resulting from ATC equipment

failure (adapted from van der Schaaf, 1995). The objective of the recovery process (i.e.

its outcome) is to restore the system to its nominal (pre-failure) state or at least to limit

the consequences of failure in the most efficient and effective way (by achieving stable

non-nominal system state). The following sections discuss the phases of controller

recovery.

5.2.1 Detection

Human recovery is a sequential process whose first step is the detection of failure.

Without this detection there is no recovery process. Therefore, the first task of the

controller is to detect the failure. As previously explained, failures can be firstly

detected either by a technical system or by a controller. Hallbert and Meyer (1995) note

that to accomplish detection by the human operator, the stimulus must be

recognisable. In other words, the stimulus must be something that a controller has

already experienced, is trained to observe, or is of sufficient intensity to interrupt the

monitoring process (e.g. visual or auditory alert positioned within the field of view but

different from the background ‘noise’ already present on the radar screen or other

operational support system).

Thus, detection is triggered by any mismatch between the expected effects and

observed outcomes. The mismatch can be explained on the basis of the information

that is matched against the frame of reference or range of the expected system

responses. For example, after issuing an instruction for a flight level change to an

aircraft, the controller expects to see the old flight level gradually changing toward the

new one. However, if the controller observes a flight level change outside the expected

3 Failures can be firstly detected either by a technical system or by a controller. Failures

detected by a technical system may trigger the generation of an alert (via warning device) transmitting information on failure to the controller. However, failures can also go unnoticed by the technical system and be detected by a controller working with fallible equipment.


114

values, then this expectation will trigger the identification of some sort of ‘fault’. This

‘fault’ can be caused by an erroneous flight level change by the pilot or system readout

of the aircraft altitude (e.g. due to radar garbling).

In the case of a total failure of a particular function, it is easier to detect and diagnose

the significance of the change, since the failure is obvious. However, in the case of a

partial failure of a particular ATC function (e.g. corruption of tracks and squawks),

detection may be more challenging. In these circumstances, detection is based on the

controller’s memory of aircraft’s past positions and future trajectories, aided by

available tools (e.g. flight strips). An example of potential difficulties encountered by

controllers in detecting partial equipment failure is reported by Sampaio and Guerra

(2004). In this example, a sudden failure of the Radar Data Processing System (RDPS)

affected only one radar track and went unnoticed by the controller for 21 minutes (see

Chapter 4, section 4.2.1).

Detection is also closely connected to the time course of equipment failure

development, namely sudden, gradual, or latent failures (see Chapter 4, section 4.1.3).

Sudden failures do not allow any time to prepare, but are usually detected immediately.

On the other hand, detection of gradual failures may be extremely difficult and delayed.

Persistent (latent) failures are almost impossible to detect. They might exist in the ATC

system for a long period of time before they are detected. This is confirmed by

interviews conducted during this research with the aim of augmenting the theoretical

sources of information. Engineers from three European ATC Centres confirmed that

latent failures (mostly software failures) tend to go unnoticed until some other event or

failure reveals their existence (for evidence see Appendix II).

There are various other factors that can hinder failure detection, such as difficulties in

observing system feedback or remembering expectations about effects. Detection can

also be made difficult by inappropriate system design (e.g. poor human machine

interface, poor quality or position of alert), workplace layout, or controller working

strategy. As an example, an alert that is barely visible or audible may remain

undetected even by a highly alert controller.

Often, successful detection occurs as a consequence of a combination of design

qualities and mental resources. An example is taken from one of the European ATC

Centres where the label of the ATC function positioned in the ‘general information

window’ changes its colour from white to yellow in the case of a failure. However, in the


115

training facility of the same ATC Centre, within the same window, one specific label is

designed to be colour-coded yellow regardless of its status (i.e. label ‘Lines’ refers to

the status of the communication lines between a number of ATC Centres). Such a

training platform design feature has the potential to result in the missed detection of a

failure by a controller as a result of a continuous and consistent presence of the yellow

colour in the ‘general information window’.

Besides the quality of an alert, its onset also plays an important role. As previously

discussed in Chapter 4, alert onset (i.e. Time-To-Alert or TTA) is defined as time

between a system’s detection of a failure and the moment an alert is presented on the

Human Machine Interface (HMI) either by colour change or text message. More

importantly, the future concept of cognitively convenient alarm onset aims to

circumvent these human limitations by providing an alert, for the system-detected

failure occurrence, at the moment when levels of controller workload allow its detection

(see Chapter 4, section 4.3.2).

The above discussions have highlighted that detection can be either enhanced or

hindered by a combination of technical and human related factors. External stimulus,

past experience, appropriate design solutions, and sudden development of equipment

failures tend to enhance detection. However, inappropriate system design, high levels

of workload and fatigue may hinder failure detection. Similar conclusions are drawn

from the study on human recovery performance in nuclear power plants by Kaarstad

and Ludvigsen (2002). Based on a literature review, an experimental investigation, and

field studies, they identify the three most significant factors that affect the detection

phase. These are:

� communication - interaction with colleagues can provide information to detect a

failure;

� system feedback - cues directly found in the operational environment (e.g. alerts,

other non-usual system event); and

� internal feedback - mismatch between operator’s expectations of

system/environment and the existing system status.

All above mentioned factors are relevant within the ATC environment. For example,

communication represents an important factor as the information on an equipment

failure can come from the supervisor or the system control and monitoring unit.

Similarly, in the ATC environment internal feedback is referred to as ‘mental model’.

Once the controller is aware of information mismatch, his or her task is to rapidly


116

determine the significance of that mismatch. Generally, the existing system output is

compared with the previously observed one, to determine whether the change is within

tolerance. For example, if an aircraft is in level flight no flight level change should occur

and any deviation from the cleared flight level should trigger the detection of an

unusual event (e.g. pilot error, radar garbling).

The detection phase is investigated further using data from a questionnaire survey and

an experiment in Chapters 6 and 10 respectively.

5.2.2 Diagnosis

Once detection occurs, the diagnosis phase (also known as explanation, localisation,

or identification phase) determines what the failure is, its cause, and what should be

done to correct it. A controller needs a good knowledge of a failure to determine what is

occurring and its effects (e.g. what to expect in the near future, whether the function is

still partially available or totally lost, any problem with data integrity and possible impact

on other tools). This is especially important in the ATC environment where the overall

system consists of highly integrated components and different failures may present

themselves to the controller in a similar manner. For example, a radio frequency failure

manifests itself in the same manner regardless of its cause (i.e. ground- vs. airborne-

based failure). Therefore, it is up to the controller to identify the true failure by ruling out

alternatives. In this particular example, the controller will first try to establish radio

contact with other aircraft. If communication is established with the other aircraft it is

reasonable to assume that the failure is on the aircraft side. The controller will then try

to identify if it is a receiver or a transmitter failure by asking the aircraft to squawk

identification. If the aircraft squawks identification then the pilot clearly heard the

transmission. The controller then knows that the aircraft has experienced a transmitter

failure. By employing this procedure, the controller determines the precise element of

the equipment that failed, and thus implements the most appropriate recovery

procedure.

Past research in non-ATM industries has shown that in some cases, after the detection

of a failure, the corrective actions are immediately known and implemented. In these

cases, the diagnosis phase is omitted (e.g. in the nuclear industry - Kaarstad and

Ludvigsen, 2002). Similarly, the study from the chemical process industry has shown

that the order of the phases is not always the same. More precisely, the diagnosis

phase does not necessarily follow the detection phase, especially in time-critical


117

operations. Often a quick fix might be necessary or an initial correction might occur

even before the cause of a failure has been identified (Kanse, 2004).

The findings from non-ATM industries are not entirely applicable to the ATC/ATM

environment. It is difficult to see how the diagnosis phase could be omitted simply

because proper ATC equipment failure recovery is not possible without knowing the

true nature of a failure. However, the duration and the attention dedicated to the

diagnosis phase relates directly to the level of workload experienced by the controller

at the moment of failure occurrence and during the recovery process. Through

interviews, EUROCONTROL study determined that controllers in most occasions do

not seek an explanation for a cause of failure (EUROCONTROL, 2004e). They focus

only on identifying the system that failed, which is essential to implement an adequate

recovery strategy. An example could be the code-callsign conversion failure, where,

having detected a problem, the controller has to identify the pair of aircraft affected.

This tends to be a very time-consuming process leaving no time for the controller to

consider the cause of the failure. Another example is corruption of radar data. If the

controller doubts the quality of a particular radar source in the multi-radar coverage

airspace, it is possible to use information from other radar sources. If the same failure

occurs in the single-radar coverage airspace, the controller has to disregard radar data,

initiate procedural (non-radar) control, and pass the problem to the system control and

monitoring unit. In both cases, the controller has to determine what failed and what the

impact of that failure is, in order to implement an adequate recovery strategy. The

cause of the failure is left to the system control and monitoring unit to investigate.

From the discussion above, it is clear that the diagnosis phase is important to identify

the equipment that has failed. However, if the failure is identified and corrective actions

are immediately known, diagnosis is omitted for the subsequent correction phase. The

diagnosis phase and the factors that may influence it are addressed further in Chapter

10 on an experimental investigation. Once the controller diagnoses the failure type and

its impact on the ATC system, the tasks shift to more action-based activities. In short,

the controller initiates the correction phase which is described below.

5.2.3 Correction

Failure recovery involves knowing how to undo or minimise the effect of failure and

achieve the desired system state (nominal or stable non-nominal system state,

respectively). The first priority is to minimise the effect on the air navigation service and

the exposure of the problem in terms of aircraft and time. Depending upon the


118

equipment failure type, recovery should follow available procedures (for details see

section 5.5). Some of them could be fairly simple like switching to another radar source

in multi-radar processing areas, changing to the secondary radio frequency (if the

primary one is blocked), changing unserviceable input devices (mouse or keyboard),

and switching to another console (if the current one is not operational). Other recovery

strategies could be very complex and both physically and mentally demanding. For

example, if an automated conflict detection tool fails to work properly (e.g. Short-Term

Conflict Alert – STCA and Medium Term Conflict Detection - MTCD), an alert might

appear when there is no failure, or conversely the controller might detect a conflict that

was not alerted automatically. In both instances, the controller will diagnose that the

conflict detection tool itself is not functioning properly. Immediate action would be

required to ensure the safety of all traffic. In other words, the controller will have to

detect all existing conflicts and resolve them in a timely and efficient manner without

the assistance of automated safety nets (e.g. STCA). The second priority would be to

test and restore the automated function, which would be the responsibility of the

system control and monitoring unit.

Past research in the nuclear industry has identified different types of decision events

that constitute the correction phase of recovery (Orsanu and Fischer, 1997; Kaarstad

and Ludvigsen, 2002). These are assessed for the ATC environment below:

� ignoring the failure – error/failure has been detected, but ignored by the operator for

two possible reasons: error/failure is considered irrelevant (i.e. no impact on

operations) or the operator assumes that his/her intervention may make the

situation worse. In any case the failure would have to be reported;

� applying procedures – this seems to be the most common correction type.

Therefore, it is necessary to ensure that procedures exist and that they are

appropriate to a particular failure;

� choosing a solution – in theory this is applicable when procedures are not available

and the human operator has to apply more conscious resources to comprehend the

situation. In many situations it may seem that only one solution is possible to

resolve the failure. However, in retrospect, more than one solution may be

available, while only one was considered at the time; and

� creating a solution – in this case the operator has no experience with the failure

type. No procedures, training, or past experience are available for the human

operator to draw upon. A completely new solution or strategy has to be created.


119

This represents the most resource-demanding option of all. This process

corresponds to human heuristic competence4 (Rigas and Elg, 1997).

In the context of ATC, if the failure penetrates all existing built-in defences and affects

controller performance, it cannot be ignored. Thus, the recovery from ATC equipment

failures can be accomplished by applying a predefined procedure, modifying an

existing plan, or developing a new one. However, application of an existing procedure

would be the preferred option as it puts the least strain upon the controller. Compared

to the nuclear environment, the execution of the chosen procedure has to be done in a

very short time frame (EUROCONTROL, 2004e). An important aspect of the correction

phase and recovery is coping with stress induced by unexpected failure. Interviews

with controllers conducted for the EUROCONTROL study confirmed that unexpected

failures tend to significantly increase workload and stress (EUROCONTROL, 2004e).

Controllers are unable to perform their tasks effectively with a large reduction of the

ability to cope with other adverse operational and environmental conditions.

Furthermore, the controllers interviewed highlighted that the critical incident stress

management is essential in managing the stress associated with equipment failures

(EUROCONTROL, 2004e).

The correction phase and the factors that may influence it are investigated further in

Chapter 6 and 10. From the discussions above, it is clear that existing recovery

procedures, recovery training, and past experience with equipment failures play an

important role in the overall recovery process. These three drivers build a knowledge

base for the choice or creation of the most appropriate solution for recovery from an

equipment failure. The discussion above, of the phases that constitute the process of

recovery, is followed in the next section by looking at the outcome of the recovery

process.

5.3 Outcome of the recovery process

Although the main recovery process consists of several phases, as explained

previously, these activities do not conclude the process itself (Figure 5-1). Prior to the

4 There are two types of human competences: epistemic and heuristic. Epistemic competence

refers to domain knowledge about the system which one seeks to control. It is context dependent component of the actual competence. Heuristic competence refers to a general competence for handling complex dynamic tasks. It is context independent, but it is developed over many years through both training and experience. As a result, actions and decisions become fast, automatic, without apparent conscious awareness.


120

EQUIPMENT FAILURE

HAZARD

OUTCOME

RECOVERY

RECOVERY SUCESSFUL

RECOVERY NOT SUCCESSFUL

RECOVERY CONTINUES

INCIDENT WITH FURTHER

CONSEQUENCES

outcome phase, the human operator attempts to resolve the problem, by implementing

a recovery strategy. This is followed in the outcome phase by post-correction

monitoring or post-recovery analysis to determine the actual outcome of the

implemented strategy. Therefore, the first task in this phase is the monitoring itself,

both by controllers and engineers. Proper design solutions could aid this phase by

providing post-recovery system status indicators.

Figure 5-1 Analysis of the outcome phase (adapted from EUROCONTROL, 2004e)

It might be expected that at this stage human performance requirements are similar to

those of the detection phase. However, as observed by EUROCONTROL (2004e)

there is a crucial difference. Guided by implemented corrections (recovery strategies),

monitoring by both engineers and controllers is driven more by ‘top-down’ processes,

primarily expectation. Since at this stage in the recovery process the operators have

knowledge of the failure and its cause, they also have expectations on how the system

might behave after a correction is implemented. For instance, if the system remains

unstable, operators may expect a reoccurrence of the same problem, other related

problems (common-mode or common-cause failures), or have a general suspicion that

the assessment of the problem was wrong or misleading.

Following the period of monitoring or active checks, the controller must decide whether

recovery is successful. Recovery is considered successful if the system returns to the

nominal (pre-failure) or intermediate, stable state (EUROCONTROL, 2004e).

Intermediate state represents a degraded operational state (e.g. loss of any function,

item of equipment, or a significant overload condition causing increased system

response time) which is detected and stabilised either by controllers or engineers. In

essence, the system is in the intermediate state if the consequences of failure are still

observable in the system performance while controllers are aware of the quality of


121

information they are receiving from the system and thus the quality of service they can

provide to traffic.

If recovery is unsuccessful, the controller will return to either diagnosis (to determine

the real cause of the problem) or correction phase to retry the previous strategy or

attempt a new one (Kanse and van der Schaaf, 2000; EUROCONTROL, 2004e). This

cycle of reapplied efforts continues as long as there is the time available for recovery.

Otherwise, if no time is available, the final outcome may be an incident with further

consequences (e.g. loss of separation).

The next section reviews the existing models of failure and recovery process

developed to support the research on human recovery in ATM and non-ATM industries.

5.4 Models of human recovery

Throughout the reviewed literature, only a few models cover both equipment failure and

its recovery process. On the other hand, an extensive volume of research is dedicated

to models of recovery from human error. These models are the result of work in the

field of human reliability and can be transferred to recovery from equipment failure. In

chronological order, the review begins with the work of Frese et al. (1990) and Frese

(1991), which was based on office workers’ errors and error handling in using

computers. In 1992, as part of a PhD thesis on near miss reporting in the chemical

process industry, van der Schaaf (1992) developed the Eindhoven classification model

of system failures. This model was based on Rasmussen’s Skill-Rule-Knowledge

(SRK) model of human behaviour (Rasmussen, 1982) as one of the most dominant

factors causing system failures in chemical process plants. The SRK model of human

behaviour was extended to system failures, incorporating additional root causes of

incidents, namely technical and organisational factors. The incorporation of all relevant

failure factors has created a comprehensive approach to safety management.

However, the approach has suffered from the limitations of the SRK model as

discussed below.

Bainbrigde (1984) reports problems using Rasmussen’s taxonomy of three main types

of cognitive behaviour, namely SRK. For example, the word ‘rule’ could be used for a

specific procedure, instructions, standard method based on previous experience, or

precise heuristic method. Another criticism is of the associated model for organisation

of cognitive behaviour, the so-called Rasmussen’s pyramid model. The model places

‘skilled’ behaviour at the base and ‘knowledge’ based behaviour at the top of the


122

pyramid. This model, although representing the general organisation of cognitive

behaviour, does not contain mechanisms for complex behaviour (see Bainbridge,

1984).

While the previous discussions focus mainly on models for recovering from human

error, this section further presents three models that focus on recovery from technical

failures. These are: the model by Kanse (2004) developed and tested in the chemical

process industry; the EUROCONTROL’s project on Solutions for Human Automation

Partnership in European ATM (SHAPE) and the Recovery from Automation Failure

Tool (RAFT) developed specifically for the Air Traffic Management (ATM) industry

(EUROCONTROL, 2004e); and the model of failure recovery in air traffic control by

Wickens et al. (1998). The model by Kanse originates in non-ATM industry but focuses

not only on the human as a system component, but equipment and procedures as well.

This model lays down the ideas for the RAFT. The RAFT and the Wickens’ models

were chosen because of their relevance to research in this thesis as both assess the

impact of future automation on recovery from potential failures.

5.4.1 Model by Kanse

The basic principle behind the model by Kanse (2004) is a sequence of phases that

constitute the process of human recovery, detection, explanation (i.e. diagnosis), and

countermeasures (i.e. correction). The model is based on past research and

operational data from three studies of near misses in chemical process plants. Near

misses are incidents that have the potential to, but do not result in a loss (e.g. an

accident, injury, failure).

According to this qualitative phase model (Figure 5-2) the recovery process starts by

detection of a failure. This is followed by any combination of explanation (referred to as

diagnosis in this thesis) and countermeasures (referred to as correction in this thesis),

including omitting one or both of these phases but also their recurrences. For example,

the assessment of the order of the recovery steps performed by plant operators in each

incident revealed that the intermediate phase (i.e. diagnosis) was omitted in more than

35 percent of incidents (see Table 3 in Kanse, 2004).

The model does not focus on the factors that influence the recovery process but

highlights that factors influencing recovery might be different in different domains.

Additionally, the model does not make any attempts toward the prediction of human

performance, future errors, or failures.


123

DDetection of

deviation

CCountermeasures

ENDOf recovery

process

BEGINProblem situation

arises as a result of one or more failures

EExplanation of deviation and

causes

Figure 5-2 Recovery process phase model (Kanse, 2004)

5.4.2 The RAFT Tool

The EUROCONTROL’s SHAPE project addressed the effects of automation on human

performance and future ATM concepts. A part of this project focused on the technical

failures and the controller’s ability to manage them and resulted in the Recovery from

Automation Failure Tool (RAFT), as a method for analysing technical failures.

The basic principle behind RAFT is a sequence of phases that constitute the process of

failure and recovery (Figure 5-3). Following a number of important factors that influence

the consequences of an equipment failure, the RAFT tool starts by assessing the

recovery context that has the potential to influence human recovery process (Figure 5-

3). This is followed by an assessment of the failure cause, problem definition

(according to the RAFT framework an equipment failure leads to a functional

disturbance), and the failure effects. Then, the RAFT tool moves toward the

investigation of the human recovery process. This is done separately for the controllers

and engineers involved. The final step in the failure analysis is the outcome phase

and includes an assessment of the effectiveness of the implemented recovery strategy

(Figure 5-3).

The RAFT is based on the past research and operational experience. It is based on a

qualitative model developed by Kanse and van der Schaaf (2000) for the chemical

process industry (further adapted by Kanse, 2004 as explained in the previous section).

The model by Kanse and van der Schaaf is further augmented with operational

experience, extracted from interviews with 31 ATM staff in four European ATC Centres.

The practical use of the RAFT is based on the existence of expert group-based

evaluation of each failure and prediction of how controllers are likely to respond to

equipment failures. This tool is intended to be used together with other SHAPE project

outputs for predicting controller performance in the future highly automated


124

environment (e.g. a prediction of changes in controller skill requirements, workload,

trust). The approach has neither been verified through the recovery performance in

simulated nor operational environments and still lacks the set of recovery relevant

principles to guide designers of current and future ATM systems. Second generation

prospective Human Reliability Assessment (HRA) methods could be used to develop a

predictive capability of the RAFT tool and to inform safety-adequate design principles

related to controller recovery from equipment failures.

Figure 5-3 The Recovery from Automation Failure Tool Framework (EUROCONTROL, 2004e)

5.4.3 Model by Wickens et al.

In 1998, the Panel on Human Factors in Air Traffic Control Automation established by

the Federal Aviation Administration (FAA) studied various aspects of human factors

and the role of the human in proposed future automated systems. Amongst several

different issues, research by this Panel recognised the importance of equipment

failures and recovery. The Panel proposes a model of ATC failure recovery and places

an emphasis on the consequences of degradation of automated ATC functionalities

(Wickens et al., 1998). It is assumed that the model is based entirely on available

research as the Panel focused on concepts that will characterise the future ATC

system. The basic principle behind this qualitative model is the impact of ATC

automation functionalities (left-hand side on Figure 5-4) on capacity, traffic density,

complexity, workload, situational awareness, manual skills, and recovery response

time. Each of these variables is associated with a sign (or a set of signs) indicating


125

whether automation is likely to increase or decrease the variable in question. However,

this model does not consider in detail how recovery is accomplished.

Figure 5-4 Model of failure recovery in air traffic control. Where two nodes are connected by an arrow, signs (+, -, 0) indicate the direction of effect on the variable depicted in the right node, caused by an increase in the variable depicted in the left node (Wickens et al., 1998)

The model also reflects the hypothetical function which relates recovery response time

to the level of automation (Figure 5-4). It is expected that recovery response time will

increase as the level of automation increases (shown as a dashed upward line on the

right side of the Figure 5-4), due to increased complexity, skill degradation, and overall

‘out of the loop’ phenomenon. The solid downward line reflects the decrease of the

reaction time available to controllers as a result of the introduction of higher levels of

automation. Controllers will have far less time to safely respond to any loss of

separation and fewer opportunities for effective solutions. As a result, this model

represents the Bainbridge’s (1983) ‘ironies of automation’ by overlaying two critical time

variables against each other and as a function of automation-related changes. These

variables are: the time required to establish safe separation, given a degraded ATC

service, and the time available to a controller (or a team) to react and safely recover

from a failure.

After describing the three models relevant to controller recovery from equipment failure

in ATC, Table 5-2 summarises their characteristics and identifies their limitations

addressed later in the thesis. In general, all three models are qualitative and based on

a principle of a sequence of phases that constitute the process of human recovery.


126

They are based on past research, whilst only one model is based on operational data.

The limitations identified in the last column of Table 5-2, guided the research presented

in this thesis and the main principles behind the framework for the assessment of

controller recovery. In short, the research in this thesis is verified in the simulated

environment (experimental investigation – Chapter 10), based on operational

experience (from interviews with relevant ATM staff, operational data – Chapter 4, and

the questionnaire survey - Chapter 6), and based upon detailed assessment of the

recovery context (Chapters 7 and 8).

Table 5-2 Summary of relevant models of the human recovery process

Model Context Operational

input Assessment of recovery

Prediction of recovery

Limitations

Kanse (2004)

Chemical industry

Yes (interviews and data)

Qualitative and

quantitative No

� No assessments of the recovery context

� No prediction of the recovery process

SHAPE’s RAFT tool

ATM Yes

(interviews)

Qualitative (expert-based)

Qualitative (expert-based)

� Not verified in simulated/operational environment

� Based only on interviews and no operational reports

Wickens et al.

(1998) ATM No No

Qualitative and potentially

quantitative (based on the

recovery reaction time)

� Theoretical approach

As stated previously, there are three major factors that influence the quality of

controller recovery, i.e. past experience, procedures, and training. Whilst procedures

and training are regulated within the aviation community, operational experience is

accumulated over time and controllers may or may not experience equipment failures

during their career. For this reason, the next sections describe and discuss existing

regulations regarding recovery procedures and training. Operational experience,

extracted from the questionnaire survey, is investigated in the following Chapter.

5.5 Procedures for handling ATC equipment failures

In both the literature and operational practice, procedures are recognised as the critical

factor for effective recovery. The following section provides an overview of the existing

international and national regulations on procedures for recovery from equipment

failures in ATC. This is followed by a discussion on key principles on the recovery

procedures in ATC, identified in this research.


127

5.5.1 Existing regulations

Regulation on procedures for handling ATC equipment failures, i.e. recovery

procedures, exists at three levels. These are: international (i.e. by the International Civil

Aviation Organisation - ICAO), regional or national (e.g. by the European Organisation

for Safety of Air Navigation – EUROCONTROL at the regional level and Civil Aviation

Authorities – CAAs at the national level), and air navigation service providers (ANSPs)

level.

The main activity of ICAO is the establishment of International Standards,

Recommended Practices and Procedures covering all technical fields of aviation. The

‘Recommended Practices’ are desirable objectives to which ICAO member states

should aim (but are not required) to conform with; whilst ‘Standards’ are considered

mandatory or required in the interest of safety of international air navigation (FAA,

2005). ICAO Standards and Recommended Practices are passed to the respective

regional organisation (e.g. EUROCONTROL) or directly to the national CAAs for

assessment and implementation. The national CAA is then responsible for assurance

and monitoring that these standards are properly implemented by ANSPs at the level of

ATC Centres. The current status of regulations on recovery procedures is discussed in

the following sections.

5.5.1.1 International regulation

Since 1945 ICAO has specified the standards, practices, and procedures for ATC. The

most recent edition of ICAO Annex 11 responsible for air traffic services (ICAO, 2001c)

advises that “air traffic services authorities should develop and promulgate contingency

plans for implementation in the event of disruption or potential disruption of air traffic

services and related supporting services in the airspace for which they are responsible

for the provision of such services”. This ICAO recommendation represents a summary

of the key system safety principles that need to be considered within each air traffic

service unit. Moreover, several particular equipment failures are covered separately in

the ICAO document dealing with procedures for air navigation service (ICAO, 2001a).

These are radar equipment failure, ground radio failure (blocked frequency), ground

Automatic Dependent Surveillance (ADS), and failure of Controller Pilot Data Link

Communication (CPDLC). Based upon the findings from the analysis of operational

failure reports presented in Chapter 4, ICAO has concentrated upon the appropriate

components in terms of the communication and surveillance ATC functionalities whilst

disregarding the data processing functionality.


128

In their guidance for recovery from four failure types, ICAO recommends necessary

steps to be taken by controllers and pilots, as well as ATC Centre watch managers or

supervisors. When necessary, ICAO also recommends collaboration with adjacent ATC

units. Therefore, the recovery process is not seen only as the responsibility of

controllers but all parties involved within the affected ATC Centre and region (including

the adjacent ATC unit which can provide valuable assistance in restricting or rerouting

the flow of traffic). All other failure types are left to national service providers to include

and define in their Manuals of Air Traffic Services (MATS).

5.5.1.2 European and national regulation

At European level, EUROCONTROL published guidance and recommendations for

controller training in the handling of unusual/emergency situations, known as the

ASSIST scheme (EUROCONTROL, 2003f). This scheme covers all procedures for

aircraft emergencies but paradoxically does not cover any type of ATC equipment

failure. The ASSIST programme, captured in a publicly available document, is intended

to represent only a framework to be further customised and adapted to the specific

requirements of each ATC Centre utilising local expertise. Thus, each ATC Centre is

required to assemble a team of experts, implement the current ASSIST programme,

and discuss other safety-critical events (e.g. ATC equipment failures) to be included in

emergency procedures, training, and/or aide-memoire.

5.5.1.3 Air navigational service provider regulation

National air traffic service providers may publish their own procedures for

emergency/unusual situations in the MATS. The MATS contains procedures,

instructions, and information which form the basis of air traffic services within a country.

It is published for the guidance of civil air traffic controllers, but may also be of general

interest to other associated parties within civil aviation. For example, the UK MATS is

arranged in two parts. Part 1 is published by the UK CAA (as CAP 493; UK CAA, 2006)

and consists of instructions which apply to all UK ATC units. Part 2 is published by the

UK National Air Traffic Service Provider (NATS) and consists of instructions which

apply to a particular air traffic control unit (e.g. the London Area Control Centre).

NATS publishes specific recovery or fallback procedures in their internal MATS Part 2

document. This document defines 33 failure types and relevant strategies for their

recovery (NATS, 2002) and thus reflects the particular ATC system characteristics of

the UK ATC Centres. No information regarding the methodology to compile these


129

recovery strategies is available. It can only be assumed that these recovery procedures

are a direct result of expert discussions, operational experience, and experience with

ATC system performance.

The manual advises that the planning controller should be the focal point in the sector

team during the duration of failure with the main objective to ensure that the

tactical/executive controller is supported at all times. The recovery procedure for each

of the 33 defined failures consists of the following:

� a short description of the failure (i.e. what a controller should expect, what are the

potential effects on the ATC system);

� a description of the system-generated alert (e.g. brown border, text message);

and

� a list of required recovery steps (these steps are separately defined for planner,

tactical/executive, assistant controllers, and watch supervisor).

The New Zealand air navigation service provider (i.e. Airways New Zealand) publishes

MATS as required by the Civil Aviation Authority of New Zealand. This document

recommends the use of the recovery procedures for failures of significant components

(e.g. radar data processing, flight data processing, the overall communication system),

as these have the most severe effect of ATC operations. The recovery procedures are

published as a separate document designed to be readily available at each position

(Failure Modes Quick Reference Guide-FMQRG; Airways New Zealand, 2006a). The

main objective of this document is to provide ready and quick assistance to operational

staff for handling equipment failures (i.e. aide-memoire).

The German air traffic service provider (DFS) defines emergency checklists for various

aircraft-related as well as military-specific emergencies. This document created a basis

for the development of EUROCONTROL’s guidance for controller training in the

handling of unusual/emergency situations and the ASSIST scheme (EUROCONTROL,

2003f). However, emergency checklists developed by DFS (same as the

EUROCONTROL ASSIST scheme) do not cover any ATC equipment failures.

While ICAO provides generic recommendations for recovery, ANSPs tend to publish

recovery procedures in the form of a checklist of recovery steps that controllers need to

perform upon detection of any of the pre-defined unusual situations. This form is

practical and easy to follow especially in the case on unexpected and emergency

situations, such as equipment failures. Similar to other types of emergency situations, it


130

is possible to define a set of equipment failure recovery steps whose implementation

lead to system protection and assurance of accurate situational awareness. The

selection of relevant recovery steps as well as the timely manner in which they are

implemented lead to effective or successful recovery. It is important to highlight that in

general all emergency/unusual situation procedures are intended as a general guide,

and controllers are expected to use their best judgment in any given situation.

As stated above, air navigation service providers that recognise the importance of the

existence of procedures for equipment failures publish them in their relevant manuals.

These unusual situations are slowly being included into a list of regular emergency

procedures. However, MATS manuals are not available in the public domain. For this

reason, it was necessary to set up a questionnaire survey to investigate the current

status and quality of procedures and training worldwide. The results of this survey are

presented in Chapter 6. The review of recovery procedures in ATC is concluded in the

following section by a discussion on identified areas of concern.

5.5.2 Main principles behind recovery procedures in ATC

Following the discussion of available recovery procedures in the aviation community,

this section summarises the key principles on the recovery procedures in ATC. These

are availability, design, and contents, as presented below.

The EUROCONTROL report on managing technical disturbances (EUROCONTROL,

2004e) concludes that procedures represent a critical factor for effective recovery. If no

procedures are available to the controllers, they may use their own mental models of

the ATC system and operational environment to decide on the most effective recovery

strategy. Such ad hoc performance can significantly vary depending on the quality of

the controller diagnosis of the failure occurrence, experience, available information,

and the failure complexity. Therefore, to assure minimal required safety performance, it

is essential to provide recovery procedures to controllers.

Recovery procedure design should focus on phases of the recovery process and steps

that the controller must perform to recover effectively and ensure a safe ATC service.

Furthermore, the procedure should also contain the key effects of the failure on the

operational system, so that there is no potential that the controller may implement the

wrong procedure. Appendix III presents a framework for a check-list type of controller

recovery procedure or aide-memoire that should be available at each Controller

Working Position (CWP). This aide memoire, designed in this research, is based upon


131

the characteristics of the ATC Centre that participated in the experimental investigation

(presented in Chapters 9 and 10).

Finally, assuming that recovery procedures are available, their contents must be

accurate and kept up to date (i.e. reflecting all modifications/updates in the ATC system

architecture). They must be realistic, comprehensive, clear and easy to use, easily

accessible, and linked to regular emergency training.

After discussion on the recovery procedures and their key principles in ATC, the

following section discusses training for handling ATC equipment failures in a similar

manner.

5.6 Training for handling ATC equipment failures

In line with the recovery procedures, training is recognised also as a critical enabler for

effective recovery. This section reviews the existing regulations on training for recovery

from equipment failures in Air Traffic Control (ATC) at three levels: international,

regional/national, and air navigation service provider. This is followed by a discussion

on several areas of concern on training for unusual/emergency situations in ATC, as

identified in this research.

5.6.1 Existing regulations

Regulation on training for handling ATC equipment failures, i.e. recovery training, exists

at three levels. These are: international (i.e. by the ICAO), regional or national (e.g. by

the EUROCONTROL at the regional level and CAAs at the national level), and ANSPs

level.

5.6.1.1 International regulation

ICAO guidance on human factors can be found in the Human Factors Training Manual

(ICAO document 9683; ICAO, 1998). According to ICAO, human factors principles

account for design, certification, training, operations, and maintenance, as well as safe

interfaces between humans and systems. The module of Human Factors Training

Manual highlights the necessity to train controllers on skills such as controller-

equipment relationship and operational aspects of automation (e.g. staying in the loop,

situational awareness, and the appropriate use of automated ATC equipment).

However, there is no specific guidance on training for emergency/unusual situations.


132

5.6.1.2 European and national regulation

A number of countries have realised the benefits of regular emergency training for

controllers and consequently have initiated training programs. In addition, on a

European scale, the EUROCONTROL European Manual of Personnel Licensing - Air

Traffic Controllers (EUROCONTROL, 2001d) now contains a requirement that ATC

units must include training for emergency/unusual situations in their training

procedures. It should consist of two segments: the first is to prepare trainees, prior to

validation, in the procedures used in the event of an emergency situation and the

second is for routine refresher training to enable qualified controllers to respond to

unusual or emergency situations in a competent and professional manner. The

importance of practicing unusual situations that have occurred elsewhere is recognised

and recommended as best practice. In general, the EUROCONTROL European

Manual of Personnel Licensing document details minimum standards for professional

qualification of controllers and has the aim of harmonising licensing schemes in

Europe. The following section describes how a particular incident made a significant

impact on the regulations related to emergency training within one Civil Aviation

Authority (CAA).

5.6.1.2.1 UK Civil Aviation Authority regulation

An emergency situation that occurred in the UK airspace highlighted both the

importance of the existence of training in unusual situations and the necessity for

refresher training. In short, a particular aircraft reported dangerously low oil pressures

in both engines and consequently declared an emergency situation. In this incident the

controller on duty handled the situation with a ‘text book’ performance. The controller

informed the crew on the closest diverting airport, minimised radio frequency

transmissions still passing all relevant information, and arranged direct routeing and

descent towards the chosen airport. During the course of the subsequent investigation,

the controller, a young trainee, pointed out that his actions were timely and efficient as

a direct result of the training in handling emergencies received on the day before the

incident occurred (Baker and Weston, 2001).

As a result of the recommendations made in the report on this incident, in 1994 the UK

CAA’s Safety Regulatory Group (SRG) decided to mandate such training for all UK

controllers (Baker and Weston, 2001). In 1999, an initial set of guidelines was

broadened to include team related aspects and to place additional focus on unusual

events rather than just emergencies. This change was reflected in the TRaining for


133

Unusual Circumstances and Emergencies (TRUCE) scheme. TRUCE was designed to

ensure that staff involved in the provision of an air traffic control service are trained to

recognise and handle emergency occurrences and unusual circumstances in a

competent manner. Some of the emergency/unusual situations that severely affect the

ATC operations, including equipment failures, are mandatory in the TRUCE scheme

(UK CAA, 2003).

5.6.1.3 Air navigation service provider regulation

As noted above, aviation authorities recognise the importance of regular training for

equipment failures. According to the regulations issued by Civil Aviation Authorities

(CAAs), training for emergency/unusual situations is usually set up by air navigation

service providers within their respective ATC Centres. However, the type and

frequency of emergency training can deviate from existing regulations (due to shortage

of staff and infrastructure and the high costs involved). To further augment the

regulations on recovery training available from CAAs, it was necessary to set up a

questionnaire survey to investigate the current provision of recovery training worldwide.

The results of this survey are presented in Chapter 6.

5.6.2 Areas of concern related to recovery training

Currently in ATC there are several issues of concern related to training. Firstly, the

recovery training should follow the phases of the recovery process, where adequate

time and guidelines should be given for failure diagnosis. Secondly, established

controllers have been trained in non-radar or procedural control, which is not the case

with newly qualified controllers. This means that established controllers posses the

skills to handle any degree of radar failure (as one of the most severe equipment failure

types).

Thirdly, the frequency, comprehensiveness and range of unusual situations for training

in the simulated ATC environment vary from Centre to Centre. While some ATC

Centres offer comprehensive initial training, supported by annual refresher training,

other Centres offer little or no opportunity for staff to practise coping with unusual

occurrences in a simulated environment (EUROCONTROL, 2004e). This lack of

regular training and the infrequent occurrence of serious equipment failures may lead

to a serious lack of experience with recovery performance. In addition, as the newly

more automated ATC systems tend to be reliable, controllers are deprived of the

opportunity to experience equipment failure and recovery in the operational

environment, and therefore need to gain these experiences through regular training.


134

Fourthly, in spite of the clear need for regular training, the lack of resources

(infrastructure and staff) makes it impossible to train controllers for all different types of

emergency/unusual situations and all equipment failure types. For this reason an

organised exchange of experience at the level of ATC Centres, countries, or regions

(e.g. ECAC states, EUROCONTROL member states) may provide valuable knowledge

and insight into various unusual situations and strategies to resolve them. As an

example, in 2003 an A300 was struck on the left wing by an air missile system resulting

in a complete loss of hydraulics and therefore loss of all flight controls. Reacting

rapidly, the captain recalled a television documentary he had seen about a DC-10

crash at Sioux City, Iowa, and the thrust change technique employed by the captain

and crew of the DC-10 to control their aircraft. Although the A300 crew had never

practiced this technique before, they quickly gained control despite the extreme stress

of the situation (IFALPA, 2005). This example shows the importance of exchanging

information on knowledge, performance, and strategy between human operators.

Similar experience could be achieved in the area of ATM by supporting workshops,

newsletters, and other forms of information exchange on best practices and handling of

unusual events.

Finally, the EUROCONTROL (2004e) report on managing technical failures in ATC

points out potential future problems identified through controller interviews. Firstly, it

suggests that the mental picture of the traffic situation will be more difficult to form in

the future ATC environment. Secondly, it suggests that in the future, the controller may

require more knowledge of the ATC system architecture when compared to today.

Finally, the report suggests that newly qualified controllers and fully established

controllers have different perceptions of one another: newly qualified controllers are

perceived by some fully established controllers to be more trusting of the reliability of

new equipment, having rarely experienced failures in the past, while established

controllers are perceived by some newly qualified controllers as less computer literate

and more suspicious of technology.

The previous sections of this Chapter revealed the complexity of controller recovery by

discussing its relevant phases, from failure detection to the outcome of the recovery

process. In addition, the past research identified factors that influence the quality of

controller recovery. The next section defines a set of variables that capture the

important characteristics of controller recovery. These are the context that surrounds

the controller recovery process, the recovery effectiveness, as well as the recovery


135

duration. These variables guide the design of the experiment to capture real data on

controller recovery later on in the thesis.

5.7 Definition of controller recovery in this thesis

This thesis investigates the process of controller recovery from equipment failures in

Air Traffic Control (ATC). From discussions in the preceding sections of this Chapter, it

is clear that controller recovery (as a human recovery in a particular context of the ATC

system) is a complex process that involves a number of steps that can be assessed

using different methods and variables. In summary, a credible assessment of controller

recovery should answer the following questions:

� What are the factors that influence controller recovery performance and choice of

recovery strategy (i.e. characteristics of the recovery context)?

� What is the effectiveness of the selected and implemented recovery strategy (i.e.

the required recovery steps and the outcome or effectiveness of recovery)?

� How efficiently does a controller respond to an equipment failure (i.e. the recovery

duration)?

These questions are discussed in the following sections.

5.7.1 Recovery context

Human reliability assessment research over the years has shown the important role of

the context in which human performance take place. Recent techniques now place

more emphasis on the definition of key contextual factors and their impact on the

reliability of human performance. Context affects every part of the process of

recovering from equipment failures and thus includes past experience and the status of

recovery procedure and training relevant to a particular equipment failure under

investigation. As stated by EUROCONTROL (2004e), ’context is everything’. Chapter 7

of this thesis presents a detailed review of the current understanding of contextual

factors in various ATM and non-ATM industries. The research presented in this thesis

uses these findings together with results from controller interviews to identify the

contextual factors relevant to controller recovery from equipment failures in ATC.

Furthermore, these factors are used in conjunction with an appropriate methodology to

further analyse controller performance during the process of recovery from failures and

to quantitatively define the recovery context indicator (Chapter 8). In addition, the

importance of the recovery context is further explored in the experiment (Chapters 9

and 10).


136

5.7.2 Recovery effectiveness

The recovery effectiveness of each controller responding to an unusual, emergency, or

non-nominal situation can be characterised by a set of required recovery steps.

Sections 5.5 and 5.6 reviewed the existing schemes for handling emergency

occurrences, achieved through defined recovery procedures and training. Existing

procedures and schemes were reviewed including the UK CAA’s TRUCE scheme, UK

NATS fallback procedures, Airways New Zealand, and the German air service provider

(DFS) emergency checklists, all designed to ensure that staff involved in the provision

of ATC service are trained to recognise and resolve any emergency situation in a

competent manner. In addition, the review included the overview of the

EUROCONTROL’s and ICAO’s guidance for recovery procedures and recovery

training (EUROCONTROL, 2003f; ICAO, 2001a).

In general, these safety schemes create a checklist of recovery steps that follow the

phases of the recovery process (i.e. detection, diagnosis, correction). These checklists

are written procedures and controllers are expected to know and follow them. In a

similar way, an ATC equipment failure is considered as one type of unusual/emergency

situation. Although equipment failure related procedures or checklists are not always

available, it is possible to define a set of required recovery steps, whose

implementation can assist in the protection of the system and preservation of accurate

situational awareness. The selection of relevant recovery steps and the time frame in

which they are implemented contribute to an effective or successful outcome of the

recovery process. This is explained further in the following section.

5.7.3 Recovery duration

The duration of the controllers’ recovery process is time measured from the first overt

controller action to the end of the recovery process. The end of the recovery process is

influenced by the restoration of the failed component or by the reversion to the backup

facilities (i.e. fallback systems). The analysis of operational failure reports (Chapter 4)

indicates that the longer the failure, the less severe it tends to be. As a result, the

research presented in this thesis focuses on failures of short duration. Furthermore,

past research has focused on the reaction time, while putting more emphasis on its

extreme values (see Wickens, 2001). However, extracting the controller reaction time

can be an extremely difficult task as this first reaction usually represents covert (i.e. not

directly observable) behaviour. For this reason, the research presented in this thesis


137

focuses on the controllers’ first action that is observed on the ATC system (e.g.

communication regarding identified failure, interaction with HMI).

Apart from the moment of actual detection, the recovery duration variable may also

lack some aspects of the diagnosis phase. In other words, the cognitive processes

behind understanding the new situation and prioritisation of the recovery tasks to be

performed may also occur covertly. For example, the real cause of the communication

failure is not immediately obvious as the controller needs to investigate if the failure

affects ground ATC equipment or airborne radio equipment. Both of these features of

controller recovery are considered in the design of the experimental investigation

presented in Chapter 9.

5.8 Summary

As pointed out at the beginning of this Chapter, a good understanding of recovery

requires a detailed assessment of the recovery process from both the technical and

human perspectives. Whilst the previous Chapter discussed the technical recovery, this

Chapter focuses on controller recovery. The Chapters starts by distinguishing the

objectives of two separate groups of operators involved in recovery from equipment

failures, namely controllers and engineers. While this thesis focuses solely on controller

recovery from equipment failures, the reviewed theoretical background to human

recovery is applied to the controller recovery by identifying its major phases. As a

result, the main phases of controller recovery together with the outcome of the overall

recovery process have been described. Finally, various models of human recovery,

developed for both ATM and non-ATM industries, have been discussed with emphasis

on three of the most relevant ones to controller recovery. These are: the model by

Kanse derived for recovery performance in the chemical process industry, the RAFT

tool derived specifically for the ATC operational environment, and the model by

Wickens generally focusing on the impact of different levels of automation on the

recovery process.

Apart from identifying the main phases of the controller recovery process, the review of

the theoretical background has also highlighted the factors that influence the quality of

controller recovery, namely past experience, recovery procedures and training. While

past experience is aggregated throughout the controller’s operational experience, the

current status and quality of recovery procedures and training are regulated by

international and national aviation authorities. Thus, the Chapter reviews and discusses

the current status of regulation regarding recovery procedures and training, whilst the


138

feedback regarding controllers’ past experience is gained through from the

questionnaire survey presented in the following Chapter. After reviewing theoretical

findings extracted from ATM and non-ATM research relevant to controller recovery, the

Chapter concludes by proposing a set of variables for an in depth assessment of

controller recovery. This is achieved by assessing the context, quality, and temporal

characteristics of the controller recovery process. These variables also guide the

experimental design to collect real data on controller recovery (Chapter 9).

Chapter 6 Questionnaire Survey

139

6 Questionnaire Survey

Chapter 5 showed that limited research has been carried out globally on human

reliability in relation to controller recovery. Hence, this Chapter presents the details of a

questionnaire survey scheme with the aim of overcoming the lack of knowledge and

further support the research in this thesis. The specific objectives of the questionnaire

survey are to investigate controller experience with equipment failures and to identify

factors that affect their recovery, to extract more operational experience, to investigate

the status and quality of recovery procedures and training, and to contribute to the

wider human reliability research by assessing the specific controller recovery. The

Chapter starts with the definition of the target population and sampling. It proceeds by

discussing the survey methodology identified for the collection of questionnaire

responses, design of the questionnaire, and the refinements identified by a pilot survey.

This is followed by the description of the full survey scheme (Figure 6-1). The Chapter

concludes with the methodology for the questionnaire survey data analyses structured

in three segments. These are: assessment of the sample characteristics, high-level

frequency analyses, and in depth assessment of interactions between recovery factors.


140

Figure 6-1 The flow diagram of organising a survey

6.1 Objectives of the questionnaire survey

One of the objectives of the research presented in this thesis is to address the general

lack of knowledge in the area of controller recovery from equipment failure. This is vital

in oer to enhance safety and operational efficiency in the current and future ATC

environment. As described in Chapter 5, although significant human reliability research

has been undertaken in other industries, such as nuclear and chemical processing, it is

not directly transferable to the highly dynamic ATC environment. In order to address

the issues above, the questionnaire survey presented in this Chapter focuses on four

objectives. Firstly, the survey is designed to investigate controller experience with

equipment failures and to identify factors that affect controller recovery. This is to be

achieved by extracting the operational experience from the sample of air traffic

controllers. Secondly, the survey is to be used to augment the information obtained

from the operational failure reports (as presented in Chapter 4) which lack any input on

controller recovery. This is achieved by questioning the participating controllers as to


141

the most severe failures they have experienced. Thirdly, the survey contributes to the

determination of the status and quality of recovery procedures and training in ATC

Centres (and thus augments the findings from Chapter 5). Finally, the survey is

designed to contribute to the wider human reliability research by assessing the specific

controller recovery performance.

Six key questions were formulated in order to achieve the four objectives. The

questions (below) address ATC equipment, controller recovery performance, and

status of recovery procedures and training:

� How often do controllers experience equipment failures (Q1)?

� What factors influence their recovery performance (Q2)?

� What is the most unreliable ATC equipment (Q3)?

� Is there any organised exchange of information on equipment failures and/or other

types of unusual/emergency situations (Q4)?

� Do recovery procedures exist (Q5)?

� What do controllers feel about the quality of training currently available for recovery

from equipment failures (Q6)?

Given the objectives of the questionnaire survey above, the next section defines the

target population and sample size.

6.2 Sampling

The population for this questionnaire survey should consist of controllers from various

ATC Centres worldwide. The population characteristics to be sampled in this survey

are ATC Centres with different levels of traffic and airspace complexity, and ATC

system automation, and controllers with a range of operational experience (i.e. years in

service, rating).

Using the United Nations (UN) statistics that there are 191 independent countries

worldwide (United Nations, 2006), it is possible to estimate the total number of ATC

Centres. However, data on the number of ATC Centres for each country were not

available to this research1. Therefore another approach based on the distribution of

global air traffic (Airbus, 2004) has been used. In other words, the ideal sample should

consist of regional distributions of sampled controllers that correspond to the air traffic

1 Personal correspondence with International Federation of Air Traffic Controllers' Associations

(IFATCA) revealed that this data is not available.


142

distribution as presented in Figure 6-2. Moreover, it is also important to obtain a sample

which represents the current distribution of air traffic but also account for its future

predicted growth. The predicted growth in air traffic to the year 2023 indicates the

importance of Asia/Pacific and Middle East regions, while other markets remain steady

(Figure 6-2). Airbus (2004) predicts that Asian airlines will experience the fastest

growth rates. This prediction is in line with observed changes in the aviation market

and the shift towards Asian operations (Airbus, 2004; Air Transport Action Group,

2005). Moreover, it is predicted that by 2023 the already mature North American

domestic market will lose its historical dominance to both Europe and the dynamic

Asia/Pacific region. Based on all these findings, the target of the questionnaire survey

should be to collect responses from Asia/Pacific, Europe, and North America

corresponding to characteristics of the population surveyed (i.e. different levels of traffic

and airspace complexity, ATC system automation, and controllers experience).

32 3331 32

26

33

4

25

52

40

5

10

15

20

25

30

35

Africa Latin America

and Caribbean

Asia and

Pacific

Europe North America Middle East

Region

Perc

en

tag

e

2003

2023

Figure 6-2 Distribution of world air traffic per region for the years 2003 and 2023 (adapted from Airbus, 2004)

Having defined a target population and its characteristics to be sampled, it is important

to define the size of the sample. Collecting a large sample of data would pose a

significant challenge as it would be a logistically huge task and very time consuming for

one single researcher. Therefore, the sample size needed to be contained within

manageable proportions. However, the sample still needed to be representative of the

population of controllers. As guidance, the modelling of controller operational

experience with the normal distribution requires approximately 20 data points (Shier,

2004). Increasing this minimal sample size by a factor 5, the target sample size was

initially aimed at 100 controller responses. This sample size is in line with the sample

used to support a Federal Aviation Administration (FAA) study of similar scope (i.e. 128

responses from aviation experts; Funk, Lyall, and Riley, 1996). However, target sample


143

size (in terms of number of controllers and ATC Centres sampled) would vary

according to the choice of data collection method and available resources.

6.3 Survey methodology

Surveys have long been recognised as a valid method for measuring attitudes (or

preferences), beliefs, or facts (including past behavioural experiences). Actually, one of

the most common uses of surveys is to measure individuals’ past behavioural

experiences (Weisberg, Krosnick, and Bowen, 1996). The aim of the questionnaire

survey presented in this thesis is to collect facts regarding equipment failures and

controller recovery, in particular the operational experience and status of procedures

and training for equipment failures. Therefore, using a survey to collect these types of

data is justified.

Due to the nature of this survey, the methods available were either to gather the

information directly from face-to-face interviews with controllers in various ATC Centres

or remotely by self-completion via the internet and professional networks. Although less

reliable, the use of the internet and professional networks is useful in presenting a

wider picture of controller experience and recovery from equipment failures. The

advantages and disadvantages of both methods are presented below.

Data gathering through face-to-face interviews requires visits to ATC Centres and

direct access to controllers. This approach is comparatively more reliable since it

presents the opportunity to clarify any issues either prior to or during the interview.

Moreover, it facilitates representative sampling for example within an ATC Centre as

more than one controller can be asked to participate. The drawbacks of this approach

are the practical and financial issues related to the cost of travel and access to enough

ATC Centres to generate a representative sample depending on the characteristics of

the population.

In a self-completion survey, the questionnaires are distributed using a professional

network or popular aviation related internet forums. Compared to face-to-face

interviews, this method saves time and enables more questionnaires to be distributed.

However, research has shown that the response rate is inferior to face-to-face

interviews. A response rate of 10 to 50 percent is usually achieved with self-completion

questionnaires compared to 100 percent in the case of face-to-face interviews. This

means that in order to collect 100 samples, between 200 and 1000 questionnaires

should be distributed. The questionnaires may be distributed via personal/professional


144

network and corresponding emails. However, accessing the email addresses of 200-

1000 controllers worldwide presents a significant obstacle to the distribution of

questionnaires.

Additional problems with the self-completion method are the number of responses and

the quality of survey sample obtained. The self-completion method depends entirely on

the intention and willingness of the controller to participate in the survey. Thus it is

harder to control the number of responses obtained. Apart from the high likelihood of

low response rate of a self-completion survey, another drawback is that the quality of

the answers cannot be controlled. Even in the case of straightforward questions,

respondents may misinterpret some of the questions or may need more information on

the subject under investigation. The presence of the researcher, while the respondent

is answering the questions, provides the advantage of ensuring that the respondent

understands what is required from the survey.

After careful consideration of both the advantages and disadvantages of the two survey

methods (face-to-face and self-completion), both were adopted in this thesis. This

decision was based on the need to exploit the strong points of both methods

particularly given the timing and response rate constraints. In order to maximise the

benefit of the combined approach, the design of the questionnaire must account for

their unique characteristics.

6.4 Design of the questionnaire

It is very important when designing a questionnaire to focus on information needed for

the study and to present questions in an unbiased fashion to enable responses with a

high degree of fidelity. The length of the questionnaire should also be considered.

Given the decision to use both face-to-face interviews and self-completion surveys, it

was necessary to focus on a questionnaire design that meets the requirements for both

methods. While face-to-face interviews allow a more complicated structure for the

questionnaire, additional attention has to be paid to the length of the interview. Self-

completion survey allows detailed questions to be designed using a less complex

structure. This survey method requires a written introduction to explain the objectives of

the study, its added value, and the key features of the survey itself (e.g. format, type of

questions, approximate time required for the survey completion).

One of the possible solutions was to design two sets of questionnaires; one for the

face-to-face interview and the other for the self-completion survey. However, to assure


145

the highest reliability and completeness of responses, it was decided to use one

questionnaire design in both survey methods. The aim was to design the questionnaire

survey to extract the maximum information whilst ensuring convenience for both face-

to-face and self-completion respondents. This was achieved following several design

principles. Firstly, special attention was given to clarity of questions to avoid any

ambiguity in the self-completion survey. Secondly, emphasis was placed on closed

questions, where the respondents’ answers did not require the presence of the

researcher. Closed-ended questions can be answered finitely by one of the given

answers; the simplest form being the yes/no answer. In general, these questions are

restrictive and can be answered in a few words. Thirdly, all key terms were defined.

Finally, for open questions, a list of potential answers were provided to guide the

respondents (e.g. for the question on the most unreliable ATC equipment, a

comprehensive list of various ATC equipment were provided). Open-ended questions

allow respondents to answer in their own words providing a narrative. In general these

questions solicit additional information, as they require more than one or two word

responses. Furthermore, the questionnaire was designed in a way that ensured that

any inconsistencies in responses can be identified. This was achieved through the

careful choice of questions and by having multiple questions assessing a particular

issue (e.g. recovery procedures).

The questionnaire has been structured around the main objective of the research

presented in this thesis. In other words, all the questions have been designed to

support the research on controller recovery from equipment failures in Air Traffic

Control (ATC). Based on the type of information obtained, the questionnaire is

structured in four distinct groups totalling 29 questions. The first group consists of

general and specific questions. The former covers the overall operational experience,

ratings, and the country/ATC Centre where the respondent works. The latter inquire

specifically about experience with equipment failure, asking the respondent to list

several examples in a greater detail. This first group consists of five questions.

The second group of questions inquires about the factors that affect controller recovery

by asking the respondent to rate the importance of three factors. This is followed by the

question on the most unreliable ATC systems/components, as well as the

organisational issues relevant for recovery. In total, this second group consists of four

questions.


146

The third group of questions focuses on the existence and quality of recovery

procedures at the ATC Centre where the respondent works. This group consists of 11

questions.

The fourth group of questions focuses on the existence and quality of training for

recovery at the ATC Centre where the respondent works. This group has nine

questions. The final question provides an opportunity to the respondent to add

comments and suggestions related to the entire questionnaire.

The following is a one-page example of the questionnaire which was used during the

survey (Figure 6-3). It is the second page of the questionnaire. A complete

questionnaire is included in Appendix IV, while an example of a response to the

questionnaire is provided in Appendix V.

Figure 6-3 One-page example of the questionnaire

6.5 Pilot survey

Before conducting the full survey, a small-scale pilot survey was performed to verify the

clarity of questions and the time necessary to complete the questionnaire. It surveyed

two EUROCONTROL in-house controllers, two ATM specialists, and three

psychologists with backgrounds in ATC and the design of questionnaire surveys. No

conflicting issues have been identified between them. Their input included only minor

amendments in the design of the questionnaire, such as additional emphasis on the


147

added value of the survey and how the results would be used. This information was

included in the introductory page of the questionnaire (i.e. the first page). Additionally,

the pilot survey revealed the need for some examples of ATC equipment/tools which

were added as a note after question 5. These changes were incorporated in the final

design of the questionnaire.

The following sections discuss how the survey methodology has been exploited to

achieve the target sample size.

6.6 Full survey

As discussed previously responses have been gathered using face-to-face interviews

and self-completion methods. The results are briefly presented below.

6.6.1 Face-to-face interviews

Professional visits to various ATC Centres and relevant organisations were used to

distribute questionnaires to available controllers and capture their responses through

face-to-face interviews. Using this approach, responses were received firstly from the

visited ATC Centres (involving controllers from India, Serbia, and Ireland), their training

facilities and various controllers in training (involving Irish and Maltese controllers).

Secondly, responses were received from the controllers involved in the

EUROCONTROL’s Gate to Gate project on real-time simulations (involving controllers

from the Netherlands, Germany, Italy, France, Sweden, Spain, and Slovenia). Finally,

responses were received from controllers on various courses run by

EUROCONTROL’s Institute for Air Navigation Services (IANS)2 (involving controllers

from Belgium, Ireland, Switzerland, Netherlands, Romania, Sweden). In spite of the

high costs involved, approximately 40 percent of the data were collected using face-to-

face interviews, where controllers had an opportunity to clarify any doubt before

answering questions.

6.6.2 Self-completion survey

Self-completion survey involved electronic distribution of questionnaires by Imperial

College colleagues visiting various ATC Centres and via professional networks and

popular aviation related internet forums. Countries visited included Tahiti, South Africa,

Tanzania, a number of European countries, Macau, New Zealand, Singapore, and

1 IANS provides regular courses to ATC staff from all EUROCONTROL Member States (i.e. 37

European countries).


148

China. In addition, the Imperial College colleagues exploited professional networks of

controllers to gain more responses. These networks have links to EUROCONTROL

and ATM specialists in various air navigational service providers, and hence resulted in

responses from Croatia, Finland, Switzerland, Macedonia, Moldova, India, and

Germany. Additionally, the Professional Pilots Rumour Network (PPRuNe) forum, an

aviation website dedicated to airline pilots and others in aviation business including air

traffic control staff, was also used for obtaining survey data (see PPRuNe, 2006). The

aims and objectives of the survey and the overall research were posted on this

particular internet forum on two separate occasions to attract controllers worldwide. If

interested in participating in this survey, controllers were advised to contact the

researcher and thus obtain an electronic copy of the questionnaire survey. In spite of

an initially high level of interest, only a few responses were collected using this method

(including Australia and United Kingdom). Overall, approximately 60 percent of data

was collected using the self-completion method.

6.6.3 Potential sources of errors

There are two main potential sources of error in the survey. These sources are the

respondent and data pre-processing. In general respondent errors may occur for a

variety of reasons. It was noted for example, that controllers from the same ATC

Centre gave contradicting answers to particular questions. Possible causes for this

include imprecision in the formulation of questions, lack of knowledge on the part of

controllers (on existence of recovery procedures, training, organised exchange of

information), and misinterpretation of questions. The imprecision in the formulation of

questions was addressed by the pilot study and thus should not have played a

significant role in generating respondent errors.

Lack of knowledge on the part of controllers was noted for the questions on the status

of recovery procedures, training, and organised exchange of information within their

ATC Centre. For example, while a group of controllers from an ATC Centre was aware

of the recovery procedures, others stated that these procedures do not exist. These

inconsistent responses were further investigated using the related questions. For

example, if the controller responded that no recovery procedures are defined within

his/her ATC Centre (the first question related to recovery procedures), the subsequent

questions related to recovery procedures are investigated (e.g. adequacy,

completeness, currency). The final judgement is based on all answers that were

provided in relation to recovery procedures (and not only the first one).


149

Misinterpretation was also noted in the question on the number of equipment failures

experienced annually. In this particular case, the data collection reflected the overall

misinterpretation of the term ‘equipment failure’ and the consequent variation in the

answers. While some controllers reported all equipment failures they experienced

within one year regardless of severity, others reported only major failures classified as

infrequent high severity occurrences.

The possibility of errors arising from pre-processing of the responses was mitigated by

extra care at the data input stage (i.e. double checking of each input). In the case of

multiple response questions or questions returning a range instead of a single value, a

consistent approach was taken. For example, in response to question 4 ‘What is the

average number of ATC equipment failures during one year that you experience?’ the

respondents tended to provide either a single numerical value, range, or a textual

answer. In the case of range, the middle value was taken. This method has been

applied consistently with other questions, if necessary. Textual answers have been

transformed into numerical values (e.g. ‘once in two years’ was considered as 0.5 per

year). However, sometimes these textual answers could not be transformed to

numerical values and thus the answer was omitted (e.g. question 5 segment on

frequency and duration of failure was answered ‘minutes’, ‘very frequent’, ‘very often’,

‘rarely’, ‘very rarely’, or ‘once in career’).

The next section describes the methodology behind the analysis of questionnaire

survey results.

6.7 Methodology for the questionnaire survey data analysis

This section starts with a discussion on the questionnaire data pre-processing issues. It

then proceeds with the analysis of questionnaire survey data organised in three

segments. The first segment deals with the characteristics of survey sample in terms of

number of countries, ATC Centres, and controllers surveyed. This segment also

focuses on the characteristics of controllers by assessing their operational experience

(i.e. number of years in service) and rating3. The second segment of the questionnaire

survey data analysis presents the high-level summaries of responses, i.e. simple

percentage analysis (Figure 6-4). These summaries are organised in seven sub-

groups, corresponding to the six key questions that the questionnaire survey was

3 Differentiating between Area Control, Approach Control, and Tower rating.


150

designed to answer (see section 6.1) whilst the seventh sub-group presents other

findings captured in the survey (presented in Appendix VI). The final segment of the

questionnaire survey data analysis provides an in-depth investigation of the interaction

between recovery factors previously analysed. The following sections discuss the

results and findings generated using the process in Figure 6-4.

Characteristics of the sample

58 ATC Centres134 controllers

High-level analyses

Interaction analyses

Experience with equipment failures

Factors that influence recovery performance

The most unreliable ATC

systems

Other findingsreported in Appendix VI

Questionnaire survey data

Organised exchange of information on equipment failures

Status of recovery procedures

Status of training for recovery

Figure 6-4 The flow chart of questionnaire survey analyses

6.7.1 Data pre-processing

The data collected during the survey was subjected to further statistical analysis using

the SPSS statistics package. Each respondent was given a numerical identifier (serial

number) but no identifying information, such as the person’s name, was used. The


151

choices made in the questionnaire by each respondent were recorded under each

corresponding serial number.

During the process of data pre-processing and analysis, all available responses were

taken into account. A special ‘scoring’ technique was used for questions that required

the ranking of choices (question 6). In this particular case, the controllers were asked to

‘score’ their reliance upon written procedures, situation-specific problem solving, and

other factors during the recovery process. This approach is explained in detail in

section 6.7.3.2.

6.7.2 Characteristics of the sample

A total of 134 questionnaire responses were received from 58 ATC Centres spread

across 34 countries (Table 6-1). According to UN data, this questionnaire survey

covers 17.8 percent of independent countries worldwide.

Table 6-1 Summary of the questionnaire survey sample

Country ATC Centre Number of responses

per ATC Centre Number of responses

per country

Ireland

Shannon 7

16 Dublin 4

Cork 5

Finland Kemi 1 1

Serbia Belgrade 3 3

Switzerland Zurich 8

16 Geneva 8

United Kingdom Bristol 1 1

Netherlands

Maastricht 2

4 Nieuw Milligen 1

Amsterdam 1

Germany

Karlsruhe 1

3 Langen 1

Frankfurt 1

Spain Seville 5 5

Norway

Olso 5

8 Kirkenes 1

Stavanger 1

Bodo 1

Italy

Rome 2

7

Bologna 1

Naples 2

Venice 1

Milan 1

France Paris 1

2 Nice 1

Sweden Stockholm 3 8


152

Malmo 3

Gothenburg 2

Slovenia Ljubljana 1 1

Belgium Brussels 3 3

Macedonia Skopje 1 1

Croatia

Split 1

4 Zagreb 1

Pula 1

Zadar 1

Moldova Chisinau 1 1

Iceland Reykjavik 2 2

Denmark Copenhagen 3 3

Portugal Lisbon 4 4

South Africa FAJS 2 2

Tanzania Dar el Salaam 1 1

India Mumbai 3

7 Kolkata 4

Singapore Singapore 2 2

Tahiti Papeete 6 6

Australia Melbourne 1 1

Austria Vienna 2 2

Romania Bucharest 2 2

Malta Malta 2

3 Loqa airport 1

Macau SAR Macau 3 3

Kenya Nairobi 4 4

New Zealand

Wellington 1

5 Auckland 2

Christchurch 2

China Hong Kong 1 1

Malaysia Subang 2 2

Total 34 58 134 134

Section 6.2 defined the sampling methodology to correspond to the distribution of

global air traffic per region for the year 2003, taking into account the predicted growth

and estimates to the year 2023 (Airbus, 2004; Air Transport Action Group, 2005).

Assuming a similar distribution of traffic for the period of the survey (2005 and 2006)

and predicted changes in the distribution of future air traffic, the questionnaire sample

lacks the input from two key markets, namely the North America and Middle East

(Figure 6-5).


153

20

75

5

0

10

20

30

40

50

60

70

80

Africa Latin America

and Caribbean

Asia and Pacific Europe North America Middle East

Region

Perc

en

tag

e

Figure 6-5 Distribution of questionnaire responses per region

However, looking back at the characteristics of the population surveyed, the sample

still manages to capture the diverse levels of traffic and airspace complexity, ATC

system automation, and controllers with a range of operational experience (i.e. years in

service, rating). For example, in the European region the responses from Paris,

Frankfurt, Amsterdam, Zurich, Geneva, and Maastricht represent the input from some

of the busiest European ATC Centres. Likewise from Asia, the responses from

Mumbai, Hong Kong, and Singapore represent some of the busiest ATC Centres on

the continent as well as those that have experienced considerable growth in recent

years. Finally, the sample also includes ATC Centres with technically advanced

systems, e.g. Malmo ACC in Sweden, Maastricht ACC in Netherlands, Shannon ATC

in Ireland, and the Oceanic Control Centre in Auckland, New Zealand.

Although only five percent of responses were received from the African continent, the

ATC Centres sampled were considered carefully. Johannesburg and Nairobi airports

represent the leading airports in Africa for both passengers and cargo (Air Transport

Action Group, 2005). Both regions are experiencing an increase in passenger

movement mostly as a result of growth in tourism. Failure of ATC equipment and the

recovery response of controllers are of considerable importance in such busy ATC

Centres, more so than in other ATC Centres in Africa with considerably less traffic.

Given the difficulties encountered in accessing ATC Centres and controllers worldwide

(e.g. security, logistics, related costs) and the characteristics of the population

surveyed, the obtained sample can be considered as representative of the population.

The next section assesses the adequacy of sampling achieved within each ATC

Centre.


154

6.7.2.1 Sampling per ATC Centre

Although 27 ATC Centres only had one response per Centre, analysis of these ATC

Centres shows that their characteristics do not differ from the characteristics of the

remaining sample. For example, these ATC Centres include some of the busiest ATC

Centres (e.g. Frankfurt, Paris, Hong Kong) as well as those with low traffic and

airspace complexity (Kemi-Finland, Bristol-UK, Bologna-Italy, Ljubljana-Slovenia,

Zagreb-Croatia). They also include ATC Centres with technically advanced ATC

system (e.g. Frankfurt, Amsterdam, Karlsruhe, Stavanger, and Melbourne). Finally, the

characteristics of controllers include all levels of operational experience (i.e. ranging

from 3 to 39 years in service) and ratings. In short, these 27 ATC Centres capture the

characteristics of the target population and as such will be included in the further data

analyses.

6.7.2.2 Sampling of air traffic controllers

The questionnaire survey captured interesting information related to the operational

experience of controllers, namely years of experience, country of residence, and ATC

facility location (i.e. city or airport). The survey data show that on average controllers

have more than 13 years of operational experience (i.e. length of service), ranging from

1 to 39 years. More than 77 percent of the controllers surveyed have up to 20 years of

experience. Taking into account the length of service captured in this survey, it is split

into four categories: 1-10, 11-20, 21-30, and 31-40 years (Figure 6-6). The sample is

reasonably representative of the population as all categories are represented. There

seems to be fewer respondents with over 30 years of experience in the sample

collected. However, this is expected as the majority of controllers with more than 30

years in service tend to move to operational support roles, including training,

instructing, and management.


155

Figure 6-6 Distribution of operational experience

Furthermore, Figure 6-7 presents the distribution of the ratings of the controllers who

participated in the survey. In general, most controllers have ACC ratings. As a result,

data analyses may be biased towards the experience within the ACC environment

which tends to be better staffed and with more access to advanced equipment/tools

(e.g. multiple radar sites feed the radar coverage instead of single radar site as in APP

and TWR control, and investment in the more automated systems).

3.732.24

31.34

15.67

9.7

26.12

10.45

0

5

10

15

20

25

30

35

ACC & APP &

TWR

ACC & APP ACC & TWR APP & TWR ACC APP TWR

Rating

Perc

en

tag

e

Figure 6-7 Distribution of controllers’ ratings

6.7.3 High-level analyses

This section presents high-level results from the simple percentage analyses of the

entire dataset. These summaries are organised into seven sub-groups, corresponding

to the six key questions that the survey was designed to answer (defined in section 6.1)


156

and concluding with other findings on controller recovery (captured in question 5).

Therefore, the relevant sub-groups are: experience with equipment failures in the ATC

Centre, factors that influence the recovery performance, the most unreliable ATC

systems/tools, organised exchange of information on equipment failures, status and

quality of recovery procedures, status and quality of training for recovery, and other

findings. Each of the sub-groups is discussed below.

6.7.3.1 Experience with equipment failures (Q1)

In the sample obtained, 94.8 percent of controllers did experience some kind of ATC

equipment failure in their career. Additionally, this group of controllers experienced on

average 17 equipment failures annually, ranging from less than 1 per year up to 600,

as reported by one ATC Centre. This dispersion of the results reflects the wide

variation in the interpretation of equipment failures. Some controllers interpreted the

question on equipment failures in terms of only ‘major’ (more severe) failures. Their

answers ranged from less than one (e.g. once in two years, once in five years, once in

a career) to one failure annually (34.6 percent of responses). Other controllers reported

the total number of failures experienced annually regardless of their level of severity, as

their responses ranged from dozens to hundreds. In short, the vast majority of

controllers surveyed have experienced equipment failures.

6.7.3.2 Factors that influence controller recovery performance (Q2)

Controllers were asked to rate how much they relied upon written procedures,

situation-specific strategies (i.e. context), and other factors (e.g. past experience) in

handling equipment failures. The ratings ranged from one to five, where one stands for

‘very much’, two for ‘much’, three for ‘moderate’, four for ‘minimal’ and five for ‘not at

all’.

The results show that more than 45 percent of the controllers surveyed rely on written

procedures in the event of an equipment failure at the levels of either ‘much’ or ‘very

much’ (see Figure 6-8). These controllers have on average more than 13 years of

experience, they operate in ATC Centres with recovery procedures (96.4 percent of

controllers who rated written procedures ‘much’ or ‘very much’) and recovery training

schemes (64.3 percent controllers who rated written procedures ‘much’ or ‘very much’).


157

Not at allMinimalModeratelyMuchVery much

Written procedures

50

40

30

20

10

0

Fre

qu

en

cy

3.25%

13.01%

37.4%

22.76%23.58%

Figure 6-8 Controllers’ reliance on written procedures throughout the recovery process

When it comes to situation-specific problem solving, 63.48 percent of controllers rated

this factor at the levels of either ‘much’ or ‘very much’ (see Figure 6-9). Similar to the

previous factor, the operational experience of controllers who rated this factor highest

is on average more than 13 years, they operate in ATC Centres with recovery

procedures (94.5 percent of controllers who rated situation-specific problem solving

‘much’ or ‘very much’) and recovery training schemes (63 percent of controllers who

rated situation-specific problem solving ‘much’ or ‘very much’). The only difference

observed with the previous group of controllers is that no controllers from the African

region rated situation-specific problem solving highly. European controllers tend to rely

much more on situation-specific problem solving (69.3 percent of responses captured

from European controllers) compared to their reliance on written procedures (42.7

percent).


Situation-specific problem solving

50

40

30

20

10

0

Fre

qu

en

cy

1.74%

10.43%

24.35%

35.65%

27.83%

Figure 6-9 Controllers’ reliance on situation-specific problem solving throughout the recovery process


158

Finally, 64.08 percent of controllers rated other factors (e.g. past experience) at the

level of either ‘much’ or ‘very much’ (see Figure 6-10). Similar to the previous factors,

the operational experience of controllers who rated this factor highest is on average

more than 13 years, they operate in ATC Centres with recovery procedures (90.8

percent of controllers who rated other factors ‘much’ or ‘very much’) and recovery

training schemes (58.5 percent of controllers who rated other factors ‘much’ or ‘very

much’). European controllers rely most on other factors (e.g. past experience) when

recovering from equipment failures (69.6 percent of responses captured from European

controllers) compared to Asian controllers (42.1 percent of responses captured from

Asian controllers). The sample of African controllers is too small for any comparison.


Past experience

40

30

20

10

0

Fre

qu

en

cy

2.91%3.88%

29.13%31.07%

33.01%

Figure 6-10 Controllers’ reliance on other factors (e.g. past experience) throughout the recovery process

Figures 6-8 to 6-10 and frequency analysis show that controllers mostly rely upon other

factors (e.g. past experience) when dealing with equipment failures. This is followed by

situation-specific problem solving and finally written procedures. After investigation of

factors that affect controller recovery, the next section focuses on the survey objective

and the assessment of the most unreliable ATC system/tool.

6.7.3.3 The most unreliable ATC systems/tools (Q3)

The data used for the analysis of the most unreliable ATC equipment are based on two

particular questions, 5 and 9. Question 5 consisted of examples of equipment failures

that severely impacted on the controller’s work. Question 9 asked controllers to list the

three most unreliable ATC systems/subsystems they have experienced. The data

obtained from both questions were collated and pre-processed to remove any duplicate


159

answers. This was necessary as controllers tended to give the similar response to both

questions.

The results of the analysis of questionnaire responses from 34 countries were found to

be similar to those obtained from the analysis of operational failure reports, presented

in Chapter 4. The questionnaire survey shows that the three most affected ATC

functionalities are: communication (37.2 percent of all examples provided), data

processing (24.6 percent), and surveillance (23 percent) (Figure 6-11). More precisely,

the following five equipment types are affected most:

� air-ground communication (12.03 percent of all examples provided);

� primary surveillance radar ( 9.1 percent);

� flight data processing system (7.75 percent);

� communication panel ( 7.49 percent); and

� ground to ground communication (6.68 percent).

Figure 6-11 Distribution of affected ATC functionalities as reported in the questionnaire survey

Table 6-2 establishes the link between the most unreliable ATC functionalities and

existing recovery procedures, as reported by 134 controllers from 34 countries

representing various regions of the world. The link is established based on responses

to questions 5, 9, 10, and 11. In addition, the analysis was conducted at the country

level rather than ATC Centre level to avoid direct reference to sensitive information

specific to ATC Centres. It should be noted that because of this, inaccuracies are

possible only for the cases when the controllers did not have a full awareness of the

availability of recovery procedures in their ATC Centres.


160

Table 6-2 Mapping between most unreliable ATC functionalities and existing recovery procedures for the countries sampled

Country Most unreliable ATC functionalities

Existing recovery procedure

Ireland

Communication Frequency failure, telephone failure Navigation Failure of navigational aids Surveillance Radar failure (procedural/non-radar control) Data processing Strip printer failure (emergency strip printing) Pointing/input devices

Input device failure

Power outages, procedures for all failure types

Finland Communication Surveillance Data processing

Serbia

Communication Frequency failure, telephone failure Surveillance

Data processing Flight data processing system (FDPS) failure, radar data processing system (RDPS) failure

Switzerland

Communication Frequency failure, telephone failure Navigation Surveillance Radar failure, visualisation system (radar display) failure Data processing FDPS failure Pointing/input devices

Power supply failure

United Kingdom

Surveillance Procedures for all failure types

Netherlands

Communication Frequency failure

Surveillance Secondary surveillance radar (SSR) failure, radar fallback system failure, failure of the working position (radar display)

Data processing FDPS failure, RDPS failure Pointing/input devices

Total system failure (in various gradations)

Germany

Communication Surveillance Radar failure Data processing Total system failure

Spain

Communication Frequency failure Surveillance Total radar failure Data processing Fire contingencies

Norway

Communication Frequency failure, on-line data interchange (OLDI) link failure, communication panel failure, telephone failure, headset failure, intercom failure

Surveillance Radar failure, failure of the radar display Data processing FDPS failure Pointing/input devices

Italy

Communication Frequency failure Navigation Runway/taxiway lights failure Surveillance Radar failure Data processing

France

Communication Frequency failure, telephone failure Surveillance Radar failure Data processing FDPS failure, RDPS failure

Power outage, air conditioning failure, fire evacuation, meteorological equipment failure, failure of navigation


161

aids

Sweden

Communication Frequency failure, telephone failure Surveillance Radar failures, surface movement radar failure Data processing Pointing/input devices

Safety nets

Procedures for most failure types, runway/taxiway lighting system failure, instrument landing system (ILS) failure

Slovenia Communication Frequency failure, telephone failure Data processing FDPS failure, RDPS failure Radar failure

Belgium Communication Frequency failure Surveillance Radar failure, radar fallback failure

Macedonia

Communication Frequency failure Data processing Pointing/input devices

Radar failure

Croatia

Communication Frequency failure, telephone failure Surveillance Radar failure Data processing Power outage

Moldova Radar failure

Iceland Communication Surveillance Data processing FDPS failure

Denmark Communication Frequency failure, telephone failure Data processing Radar failure

Portugal

Communication Frequency failure, telephone failure, voice switching and communication system (VSCS) failure

Navigation Surveillance Radar failure, radar display failure Data processing Strip printer failure

South Africa Communication Radar failure

Tanzania Frequency failure, telephone failure, FDPS failure, power outage

India

Communication Telephone failure, intercom failure

Navigation Failure of navigation equipment, instrument landing system (ILS) failure

Surveillance Radar failure Data processing FDPS failure Pointing/input devices

Singapore Communication Frequency failure Surveillance Radar failures, failure of radar display

Tahiti

Communication Frequency failure, failure of satellite communication Surveillance Data processing Safety nets

Navigational aids failure, tsunami alert, aircraft diverting due to terrorist action

Australia Communication Surveillance

Austria Surveillance Data processing FDPS failure , RDPS failure, failure of strip printer


162

Pointing device failure, failure of touch input display (TID), frequency failure

Romania Communication Surveillance Procedures for all failure types

Malta

Communication Surveillance Radar failure Data processing Pointing/input devices

Power supply

Macau Special Administrative Region

Communication Frequency failure Navigation Navigation aids failure Data processing

Procedures for all failure types, radar failure, SSR failure

Kenya

Communication Frequency failure, telephone failure Navigation Surveillance Data processing Strip printer failure

New Zealand

Communication Frequency failure, telephone failure Surveillance Radar failure, radar screen failure Data processing FDPS failure, RDPS failure Safety nets

Partial and total failure of all ATC equipment, evacuation of ATC centre, mouse/keyboard failure, power outage

China Surveillance Radar failure FDPS failure, frequency failure

Malaysia

Communication Frequency failure Surveillance Data processing Safety nets

The instances in which identified failures are not supported by existing recovery

procedures are highlighted in grey. In these cases, controllers experienced ATC

equipment failures for which recovery procedures were not available in their ATC

Centre. On the other hand, the instances in which sampled controllers have not yet

experienced equipment failures, for which procedures exist, are highlighted in yellow

and separated as the last row for each country. As an example, if the communication

function was affected specifically by frequency failure, the mapping is not established

(coloured grey) if the recovery procedure did not exist for this particular failure type. In

several cases controllers reported that their ATC Centre has procedures for all failure

types. Clearly it is not possible to cover all failure types but to design generic

procedures or guidelines to perform in the case of equipment failure.

It can be concluded that inadequate mapping between recovery procedures and

equipment failures experienced by controllers occurred in many cases. The most

severe cases are those in which countries do provide at best only one type of recovery


163

procedure. This was identified in several European countries (i.e. Finland, Macedonia,

Iceland, and Malta), in two African countries (i.e. South Africa and Kenya), and two

Asian/Pacific countries (i.e. Tahiti and Malaysia). The most neglected ATC functionality

was found to be data processing, followed by surveillance and communication. The

paradox is that the qualitative equipment failure impact assessment tool (Chapter 4)

identified exactly these three ATC functionalities as the most challenging to controller

recovery.

6.7.3.4 Organised exchange of information on equipment failures (Q4)

40.3 percent of the controllers surveyed reported that their ATC Centres have

organised exchange of information on equipment failures between colleagues. 49.3

percent reported a lack of this exchange of experience whilst 10.4 percent did not

answer this question.

Contradictory responses were obtained from 14 ATC Centres and are further

investigated by responses given to the subsequent question, i.e. whether the organised

exchange of experience is supported by management as a good working practice.

From the ATC Centres that have exchange of experience, 76 percent have formal

processes approved by management as opposed to the practice based on ’word of

mouth’ that reaches only a small portion of controllers. The question was intended to

capture initiatives by management to provide means to share experience on equipment

failures in an organised manner. This may be achieved using different methods, such

as seminars, company newsletters, safety bulletins, memorandums, and workshops. In

these ways the lessons learnt are disseminated not only between the controllers

directly experiencing the effects of the failure, but within the entire ATC Centre and

often within the same country.

Based on this additional assessment, the following countries do not have formal nor

informal processes for exchange of experience on equipment failures: Italy, Ireland,

Croatia, India, Slovenia, Maastricht ATC Centre (as opposed to Amsterdam Centre),

Switzerland, Slovenia, Macau SAR, and Kenya.

The data indicates that there is room for improvement. There is a clear need for the

implementation of formal processes for exchange of experience on equipment failures

including failure modes and recovery processes. This should form part of a wider safety

culture within ATC Centres which is the responsibility of management. The past has

proven this type of indirect training to have a beneficial safety impact in a similar way to


164

regular recurrent training. The example discussed in Chapter 5 mentions an incident

where A300 was struck on the left wing by a surface to air missile system resulting in a

loss of all flight controls. Reacting rapidly, the captain recalled a television documentary

on a DC-10 crash at Sioux City (Iowa) and the thrust change technique employed by

the captain and crew of the DC-10 to control their aircraft. Although the A300 crew had

never practiced this technique before, they quickly gained control despite the extreme

stress of the situation (IFALPA, 2005).

6.7.3.5 Status and quality of recovery procedures (Q5)

A section of the questionnaire consisting of 11 questions (from 10th to 20th question)

was dedicated to the assessment of recovery procedures within each ATC Centre. The

first question was designed to immediately filter out those ATC Centres without any

written procedures in place. In this case, the controller would skip the rest of this

section and proceed with the rest of the questionnaire. In cases where recovery

procedures exist, the remaining ten questions were designed to assess the quality of

those procedures. These questions focused on the completeness of the recovery

procedure, the level of currency, clarity, realism or feasibility, accessibility, and

compatibility with other procedures. In addition, controllers were given the opportunity

to comment on any event for which there was an inadequate application of recovery

procedures in their working experience.

The analysis of the questionnaire responses highlighted some inconsistencies (marked

with ‘?’ in Table 6-3). In these cases, the controllers from the same ATC Centre gave

opposite responses to the questions on the existence of recovery procedures, recovery

training, and/or recurrent training. These are further investigated using the responses

to the subsequent questions related to recovery procedure (11th to 20th question),

recovery training (25th to 28th question), and recurrent training (23rd and 24th question).

In this section, further investigation regarding the existence of recovery procedures is

conducted for Shannon, Cork, Brussels, and Nairobi ATC Centres (Table 6-3) using the

answers provided from 11th to 20th question. Although controllers from these ATC

Centres reported a lack of recovery procedures in the 10th question, their subsequent

answers revealed that these procedures do exist (at least for some failure types).


165

Table 6-3 Existence of recovery procedures, recovery training, and recurrent training as reported in the questionnaire survey

Country ATC Centre Existence of

recovery procedure

Existence of training for equipment failures

Existence of recurrent training

Ireland

Shannon ? Yes ?

Dublin Yes No ?

Cork ? ? ?

Finland Kemi No Yes Yes

Serbia Belgrade Yes No No

Switzerland Zurich Yes Yes ?

Geneva Yes Yes ?

United Kingdom

Bristol Yes Yes No

Netherlands

Maastricht Yes ? Yes

Nieuw Milligen Yes Yes No

Amsterdam Yes Yes Yes

Germany

Karlsruhe Yes Yes No

Langen Yes Yes No

Frankfurt Yes Yes Yes

Spain Seville Yes ? No

Norway

Olso Yes Yes Yes

Kirkenes Yes Yes No

Stavanger Yes No Yes

Bodo Yes Yes Yes

Italy

Rome Yes No ?

Bologna Yes No No

Naples Yes No No

Venice Yes Yes No

Milan Yes No No

France Paris Yes Yes No

Nice Yes No No

Sweden

Stockholm Yes No No

Malmo Yes Yes Yes

Gothenburg Yes Yes Yes

Slovenia Ljubljana Yes Yes Yes

Belgium Brussels ? No No

Macedonia Skopje Yes No No

Croatia

Split Yes No Yes

Zagreb Yes No No

Pula No No Missing data

Zadar No No Missing data

Moldova Chisinau Yes Yes Yes

Iceland Reykjavik Yes No ?

Denmark Copenhagen Yes Yes Yes

Portugal Lisbon Yes ? ?

South Africa FAJS Yes Yes Yes

Tanzania Dar el Salaam Yes Yes No

India Mumbai Yes ? Yes

Kolkata Yes ? No

Singapore Singapore Yes Yes Yes

Tahiti Papeete Yes ? ?

Australia Melbourne Yes No No


166

Austria Vienna Yes No Yes

Romania Bucharest Yes Yes Yes

Malta Malta No Yes No

Loqa airport Yes Yes Yes

Macau SAR Macau Yes ? ?

Kenya Nairobi ? Yes No

New Zealand

Wellington Yes Yes No

Auckland Yes Yes Yes

Christchurch Yes ? Yes

China Hong Kong Yes Yes No

Malaysia Subang Yes ? No

Table 6-2 shows that 93.1 percent of sampled ATC Centres do have some form of

recovery procedure in place (i.e. 54 ATC Centres). The types of equipment failures

mostly covered by recovery procedures in sampled ATC Centres are:

� radar failure (reported by 40.2 percent of controllers surveyed);

� failure of communication function: radio telephony, ground to ground

communication, voice switching and communication system panel (reported by

43.3 percent of controllers surveyed); and

� flight data processing system failure (reported by 12.69 percent of controllers

surveyed)4.

74 percent of controllers reported that these recovery procedures are kept up-to-date

and reflect the changes in hardware and software occurring in the ATC Centre.

Similarly, 72 percent of controllers rated available recovery procedures as

comprehensive, while only 55 percent rated them as complete. The remaining 45

percent of controllers surveyed rated available recovery procedures as incomplete (i.e.

missing recovery steps necessary to re-establish a safe ATC service). When asked

which types of recovery procedures should be added, the controllers mostly

emphasised the requirement for recovery procedures from radar failure,

communication systems failure, the need for back-up systems, and procedures for

handling outages at ATC Centre level. Furthermore, 88 percent of controllers rate

available recovery procedures as clear and understandable, while 72 percent rated

them as realistic and feasible to perform.

69 percent of controllers surveyed reported that recovery procedures documentation is

easily accessible, i.e. they are placed in close proximity to controller working positions.

4 The discussion presented in Chapter 5 showed that ICAO provides recovery procedures for

the communication and surveillance functionalities but not for the data processing functionality.


167

Finally, 77 percent of controllers reported that available recovery procedures are linked

or harmonised to other procedures specified within the Manual of Air Traffic Services

(MATS), e.g. on suite allocation of tasks (separation of responsibilities between

executive and planner controller), and duties of the staff such as the approach

controller, the ground controller, or the watch manager.

From the survey data and subsequent analyses, it can be concluded that majority of

sampled ATC Centres have some form of recovery procedures. The majority of

controllers reported that these procedures are up-to-date, comprehensive, easily

accessible, and compatible with other procedures. Moreover, controllers emphasise

the need for procedures on radar and communication failures.

6.7.3.5.1 Other findings regarding the recovery procedures

In addition to the findings in the previous section, the questionnaire’s narrative section

highlighted interesting safety-relevant issues regarding recovery procedures. These are

individual comments rather than findings representative of the entire sample. The

reported issues are categorised in three groups, namely equipment specific, teamwork

specific, and generic recovery related issues. These are discussed in the following

paragraphs.

The equipment related issues highlighted major problems with the flight data

processing system not covered in the operational manuals. In addition, controllers

reported a lack of back-up facilities. One example indicated that during radio

communication system failure, a particular ATC Centre had only ten emergency radio

devices for the operational room with a 20 seat configuration.

On teamwork related issues, the controllers mostly reported inadequate familiarisation

with contingency procedures on the part of technical staff and controllers in

neighbouring sectors. In general, the controllers highlighted the important role of

teamwork and the need for an experienced planning controller in the event of

equipment failure. Another example drew attention to the unavailability of technical staff

during night shifts to immediately provide assistance in the case of equipment failure.

In short, controllers feel that teamwork is important in dealing with failures and that

Team Resource Management (TRM) training, aimed at enhancing teamwork efficiency,

should be mandatory for all ATC Centres.


168

Finally, many individual recovery related issues, such as context, procedures, and

working practice, are also highlighted in the questionnaire’s narrative part. These are

as follows:

� Situation-specific problem solving plays a major role as all equipment failures

occur within a specific context (e.g. bad weather, frequency jamming, high/low

traffic levels);

� There is a need for a similar approach to recovery procedures as are available to

pilots. In other words, a comprehensive manual with all possible failures and

corresponding recovery steps is needed during controller training. For the

operational environment, it would be necessary to design an abbreviated version

of the contingency manual available at each controller working position (e.g. aide-

memoire in the form of check-list, see Appendix III); and

� Accurate and efficient strip marking is seen as the most reliable recovery tool in

the case of radar or flight data processing failure.

6.7.3.6 Status and quality of training for recovery (Q6)

A section of the questionnaire consisting of eight questions (from 21st to 28th question)

was dedicated to the assessment of training in recovery from equipment failures within

each ATC Centre. The first question was designed to immediately filter out those

Centres without training schemes. In this case, the controller would skip the reminder

of this section and proceed with the final part of the questionnaire. In the case of the

existence of a recovery training scheme, the remaining seven questions were designed

to assess its quality by extracting information on the existence of recurrent training, its

frequency, content, and compatibility with other types of training. The final section of

the questionnaire provided the opportunity for controllers to comment on other issues

of relevance to training.

The analysis of the collected data firstly revealed inconsistencies in the responses to

questions on training (Table 6-3). The reason for this may be that some controllers

assumed their initial training, e.g. initial radar control training, as training for recovery.

Other controllers may have considered only separate training for emergency situations

and whether it involved some type of equipment failure.

30 ATC Centres (51.7 percent) have training for recovery for equipment failures, 18

ATC Centres (31 percent) do not, while data for 10 ATC Centres (17.3 percent) are

inconsistent (i.e. marked with ‘?’ in Table 6-2). In these cases, the controllers from the


169

same ATC Centre gave opposite responses to the questions on existence of recovery

training. All these inconsistencies are further investigated using the subsequent

questions related to recovery training (i.e. 25th to 28th question). Although controllers

from these ATC Centres reported contradictory responses on existence of the recovery

training (i.e. 21st question), their answers to subsequent training-related questions did

not reveal any further information. Therefore, a conservative approach has been taken

and these 10 ATC Centres are considered not to have recovery training in place.

In the case of recurrent training, the analysis shows that only 36.2 percent of the whole

sample of ATC Centres have recurrent training, 43 percent do not, while the rest of the

data is either inconsistent or missing. Recurrent training is provided once a year in 25

ATC Centres and bi-annually in three ATC Centres (Oslo-Norway, Bucharest-Romania,

Auckland-New Zealand). In addition, Geneva and Melbourne ATC Centres provide

recurrent training three times per year, while Frankfurt ATC Centre provides recurrent

training 20 times per year. In the latter a contingency system is used every weekend to

train controllers.

Further analysis of the ATC Centres with recurrent training frequency higher than once

a year, shows that all have recovery procedures in place, while the majority (i.e. 64

percent) have an organised exchange of information on equipment failures. The

Auckland ATC Centre emphasised that recovery performance was difficult before the

introduction of clear and easy to follow procedures. Moreover, this ATC Centre

highlighted that operations impact on recovery training as the recent failure types are

included in the recurrent training. Although the Oslo ATC Centre has recovery

procedures, its controllers report the need for more comprehensive and easily available

procedures (e.g. checklist type procedures on each console). These controllers

expressed a need to step away from increased dependency on experience when

handling equipment failures.

From the subset of controllers who have recurrent training once a year, 55 percent

believe that this is adequate, with the rest express the need for higher frequency in

order to build competency in handling unexpected equipment failures. When asked if

the training covers all important equipment failures, the majority of controllers (i.e. 63

percent) answered negatively. The most frequent issues mentioned to be added to the

current training syllabus are:

� complete radar failure simulated in a comprehensive and realistic way;

� total power failure;


170

� facility evacuation;

� team resource management (TRM);

� different types of aircraft problems (e.g. communication failure, engine failure,

landing gear problem);

� hot standby procedures (system running in the background ready for immediate

use); and

� radar bypass (radar information is presented directly at the radar display without

having been processed, resulting in the presentation of uncorrelated tracks only).

61 percent of controllers believe that the training methods utilised in their ATC Centres

are suitable, or more precisely, realistic and varied. Furthermore, according to the

responses from 63 percent of controllers surveyed the recovery training is compatible

(i.e. linked to other training schemes). In general, it is essential to harmonise recovery

training within the overall training syllabus. One option is to include recovery training

within each training course, such as ab-initio training, conversion course, continuity or

recurrent training, training for unusual situations, and TRM training. The other option is

to provide separate recovery training sessions on a regular basis. Regardless of the

approach, ATC management has to assure an inclusive, regular, and consistent

approach in training for recovery to its entire population of controllers.

From the survey data and subsequent analyses, it can be concluded that the majority

of the ATC Centres surveyed have some form of recovery training although not

necessarily provided consistently throughout the Centre. The situation with recurrent

training is worse as in the majority of cases, this type of training is not provided

regularly. This results in the extensive reliance on experience in dealing with equipment

failures which may pose a significant safety threat in ATC Centres with a large

percentage of newly established and thus less experienced controllers. In general, the

controllers surveyed want to step away from over reliance on experience and be

regularly trained as much as possible.

6.7.3.6.1 Other findings on training for recovery

In addition to the findings in the previous section, the questionnaire’s narrative section

highlighted interesting safety-relevant issues regarding recovery training. These are

individual comments rather than findings representative of the entire sample. The

reported issues focus on the quality and frequency of recovery training.


171

According to the controllers surveyed the main problem is the overall lack of training,

for supervisors, engineers, and controllers. The controllers believe that a couple of

hours of training per year is far too little practice and some of them feel that recurrent

training is necessary at least twice a year. In the event of more critical equipment

failures (e.g. radar) with high traffic levels, there may be occasions that there is no time

to act upon the recovery procedures. On these occasions the role of training as well as

teamwork has a much greater importance.

The controllers are aware that it is almost impossible to include everything that can go

wrong within the training syllabus, but emphasise that more training and guidance

should be given. They also highlight that training sessions should be as realistic as

possible in the simulated environment (e.g. higher traffic levels and the need to use

radar fallback system regularly). Currently, in some ATC Centres, the training only

focuses on outages (i.e. failure of the entire ATC system) and not on everyday failures.

An example of an ATC Centre where recurrent training takes place only on a night shift

highlighted inconsistent provision of training throughout the ATC Centre, as only those

controllers on a night shift get recovery training.

6.7.3.7 Other findings on recovery performance

This section deals with additional findings extracted specifically from question 5. This

question aimed to provide an opportunity to controllers to discuss their past experience

with equipment failures which seriously impacted on their work. The findings extracted

from question 5 are presented in Appendix VI.

While section 6.7.3 has provided a high level analysis and results of the survey, the

following section carries a more rigorous analysis of the data.

6.7.4 Interaction analyses

The data analyses started with the assessment of the sample characteristics and

proceeded with the high-level summaries of controller responses. In this section, the

final set of data analyses investigates the relationships between the characteristics of

controllers (e.g. operational experience) and various recovery factors using appropriate

statistical tests. The section starts by the qualitative assessment of potential

interactions and identification of those relevant to controller recovery. This is followed

by the presentation of appropriate statistical tests and their key findings.


172

Several reciprocal interactions amongst controller characteristics and recovery factors

(correspond to key question defined in section 6.1) are chosen for further statistical

testing and marked with symbol ‘√’ (Table 6-4). This choice is based on known

relationships from operational experience further tested using the rigorous statistical

assessment. The focus is placed on controller recovery and factors that influence it,

which corresponds to a total of eight interactions.

Table 6-4 Interaction matrix

Opera

tion

al

exp

erie

nce

Rating

Experi

en

ce

with

equ

ipm

en

t fa

ilure

s

Fa

cto

rs t

ha

t in

flue

nce r

eco

very

p

erf

orm

an

ce

Fo

rmal exch

an

ge o

f in

form

atio

n

Exis

ten

ce o

f re

co

ve

ry p

roce

dure

s

Exis

ten

ce o

f re

co

ve

ry t

rain

ing

Operational experience (length of service) √ √ √

Rating √ √

Experience with equipment failures (frequency per year)

√

Factors that influence recovery performance √ √

Formal (management supported) exchange of information

Existence of recovery procedures

Existence of recovery training

The nature of the variables under consideration determined which statistical methods

could be used to analyse the data. As can be seen from their description in this

Chapter, three variables are categorical (rating, factors that influence recovery

performance, formal or management supported exchange of information on equipment

failures) whilst two represent a continuous or ratio scale variable5 (operational

experience-length of service, experience with equipment failures-frequency per year).

As data differ significantly from the normal distribution, several non-parametric tests

with 95 percent significance level have been used. As previously explained in Chapter

4 (section 4.4.1), chi-square tests are used to test the relationships between two

categorical variables. Furthermore, the Cramer’s V test is used to measure the

5 As mentioned in Chapter 4, variables can be either continuous or categorical. Continuous

variables are numeric values on an interval or ratio scale (e.g. age, income). Categorical variables can be either nominal or ordinal. Nominal variables differentiate between categories but do not assume any ranking between them (e.g. gender). On the other hand, ordinal variables differentiate between categories that can be rank-ordered (e.g. from lowest to highest).


173

association for nominal data (i.e. interactions between ‘factors that influence recovery

performance’ with ‘rating’ and ‘existence of formal exchange of information on

equipment failures’) whilst the Kendall tau test is used for ordinal data (i.e. ‘factors that

influence recovery performance’). The relationship between two ratio variables is tested

via non-parametric correlation or Kendall’s tau statistics which uses the ranks of the

data to calculate correlation coefficient. Correlation coefficient ranges between -1 and

1, where its sign indicates the direction of the relationship (either positive or negative)

whilst its absolute value indicates the strength of the relationship.

Finally, the relationship between ratio and categorical variable is tested using the non-

parametric Mann-Whitney test. The test is used to assess whether two samples of

observations come from the same distribution (Shier, 2004). The test involves the

calculation of a statistic, referred to as ‘U’ (see equation 6-1).

,2

)1(1

1121 R

nnnnU −

+

+= 6-1

where n1 and n2 are the two sample sizes, and R1 is the sum of the ranks all the

observations in sample 1. Samples greater than 20 are assumed to follow normal

distribution, thus U statistic is converted to a Z score using the formula in equation 6-2

(Shier, 2004):

12

)1(2

value U largest

2121

21

++

−

=

nnnn

nn

Z 6-2

The results of all tests are presented in Table 6-5.

Table 6-5 Statistical tests and results obtained

Variable 1 Variable 2 Test

Statistical significance at 95

percent confidence level

Operational experience (length of service)

ACC Mann-Whitney non parametric

test

p>0.05

APP p<0.001 (U=1382.5,

z=-3.56)

TWR p=0.014 (U=3387.5,

z=-2.46)

Operational experience (length of service)


Non-parametric test (Kendall’s

tau) p>0.05

Written procedures Mann-Whitney non parametric

test

p>0.05 Situation-specific problem solving

p>0.05

Other p>0.05


174

Rating

ACC Number of equipment failures experienced annually (Q4)

as above

p>0.05

APP p>0.05

TWR p>0.05

ACC Factors that influence recovery performance

Non-parametric test (Cramer's V)

p=0.0086

APP p>0.05

TWR p>0.05


Written procedures Mann-Whitney non parametric

test

p>0.05 Situation-specific problem solving

p>0.05

Other p>0.05


Written procedures


Non-parametric test (Kendall’s

tau)

p>0.05

Other p>0.05


Other p<0.001


Written procedures

Formal exchange of information (Q7)

Non-parametric test

(Cramer's V)

p>0.05


p>0.05

Other p=0.029

Statistical tests performed indicated five significant relationships (Table 6-5). Significant

relationships are found between controllers with APP rating and TWR rating and years

of operational experience (i.e. years in service). In the sample surveyed, controllers

with APP rating have more operational experience compared to those without this

rating. Similarly, controllers with TWR rating have more operational experience

compared to those without it. Secondly, a significant relationship is identified between

other factors that influence recovery performance and ACC rating. Data indicates that

controllers with ACC rating tend to rely upon other factors (e.g. past experience) more

than those without ACC rating. This is expected as controllers with ACC rating in the

available sample have more operational experience than those without ACC rating.

Thirdly, a significant relationship is identified between controller reliance on situation-

specific problem solving and other factors (e.g. past experience) when recovering from

equipment failures. This is expected as past experience represents one of the factors

that define the situation surrounding (context) of an equipment failure. Finally, a

significant relationship is identified between controller reliance on other factors (e.g.

past experience) when recovering from equipment failures and management supported

6 Relationship between other factor that influence recovery procedure and ACC rating.


175

exchange of information regarding equipment failures (Table 6-5). It may be the case

that controllers account for exchange of information regarding equipment failures as a

type of past experience.

On the other hand, no relationship is identified between the factors that influence

recovery process and operational experience (i.e. number of years active as a

controller). Although it was expected that less experienced controllers may rely more

on written procedures and that more experienced controllers may rely more on past

experience, statistical testing did not support these expectations. Years in service do

not differentiate between reliance upon a written procedure, context, or other factors

(e.g. past experience). It may be the case that the overall safety culture built in the ATC

Centre determines what a controller may use as the main resource in recovering from

equipment failures. Therefore, if the procedures are not available, they will rely more on

situation-specific problem solving. Therefore, this decision would be based on

organisational issues more than their own experience.

6.8 Summary

This Chapter has discussed in detail the questionnaire survey that sampled 134

controllers in 58 ATC Centres from 34 countries. The survey was designed to achieve

four main objectives. Firstly, to build on the literature review to further investigate

equipment failures and factors that influence controller recovery by introducing

operational experience. Secondly, to support the information obtained from operational

failure reports (as represented in Chapter 4), which lacked the input on controller

recovery. Thirdly, to assess the status and quality of recovery procedures and training

in the sampled set of ATC Centres. Finally, to contribute to the wider human reliability

research with a particular focus on controller recovery from equipment failures.

The results of the analyses conducted on the data consist of several interesting

findings. These are structured around six key questions that this survey addresses.

� How often do controllers experience equipment failures (Q1)?

Almost 95 percent of controllers surveyed experienced ATC equipment failure in their

operational career. The investigation of frequency of failures per year revealed that

major failures tend to occur only once a year or once in two years, while less severe

failures tend to occur with a relatively high frequency. These findings are in line with the

results obtained from operational failure reports and their categorisation based on

severity (presented in Chapter 4).


176

� What factors influence their recovery performance (Q2)?

Investigation of the factors that mostly influence controller’s recovery performance

has revealed that factors other than written procedures and situation-specific problem

solving have the greatest impact, e.g. past experience. However, differences

between these ‘other’ factors (e.g. past experience) compared to written procedures

and situation-specific problem solving are not large, i.e. the controllers rated the

importance of all listed factors similarly.

� What is the most unreliable ATC equipment (Q3)?

Investigation of the most unreliable ATC equipment, based upon the experiences of the

controllers surveyed, has shown a match with the results obtained from the analyses of

operational failure reports (as presented in Chapter 4). The most affected ATC

functionalities are the communication, surveillance, and data processing. The most

unreliable ATC equipment incorporates air-ground and ground-ground communication,

radar coverage, and the flight data processing system. These findings, together with

those from Chapter 4, led to the selection of the equipment failure to be simulated in

the experiment presented in Chapter 9 (i.e. the flight data processing system failure).

� Is there any organised exchange of information on equipment failures and/or other

types of unusual/emergency situations (Q4)?

The organised exchange of information of equipment failure represents an ‘indirect’

experience and a learning opportunity. Through presentation, seminars, and safety

bulletins, the controllers could be presented with failure types, contextual conditions

surrounding the failure, and the difficulties experienced by their fellow colleagues in

handling the situation. However, in the sample obtained almost half of the controllers

did not have this kind of information exchange organised in their ATC Centres.

� Do recovery procedures exist (Q5)?

Assessment of the existence and quality of recovery procedures shows that the

majority of sampled ATC Centres have some type of recovery procedure in place,

mostly for radar failure, communication failure, and flight data processing system

failure. The analyses also show that most of these procedures are kept up-to-date but

not always complete. Therefore, additional emphasis should be placed on the revision

of existing procedures to assure that the recovery steps presented are complete and

that these follow a logical order. However, attention should be paid to the trade-off

between the thoroughness of the procedure and limited time available to perform all


177

prescribed steps and thus to recover. An example of a concise check-list type recovery

procedures developed in this thesis for a specific European ATC Centre is presented in

Appendix III. It is based on a format used previously by the German air traffic service

provider (DFS) accepted and published by EUROCONTROL (2003f).

� What do controllers feel about the quality of training currently available for recovery

from equipment failures (Q6)?

Assessment of the existence and quality of training for recovery shows that only half of

the ATC Centres surveyed have established training for recovery from equipment

failures. The situation with recurrent training is even worse as only 36 percent of ATC

Centres surveyed organise regular recurrent training. In most cases, recurrent training

is provided only once a year, while in nine ATC Centres it is provided twice a year. On

the other hand, controllers support the idea of very frequent recurrent training. Almost

half of the respondents (i.e. 45 percent) feel an annual training session for a couple of

hours is simply not enough to keep them proficient and ready to deal with unexpected

equipment failures.

The process of identification of factors that affect controller recovery started in the

previous Chapter by an overall assessment of past research relevant to controller

recovery. It has continued in this Chapter by expanding these findings with the

questionnaire survey results and operational experience of controllers worldwide.

Based on these findings, the next Chapter finalises this rigorous process by identifying

factors that affect controller recovery, referred to as ‘Recovery Influencing Factors’

(RIFs).

Chapter 7 Methodology for a Selection of Relevant RIFs

178

7 Methodology for a Selection of Relevant Air Traffic Controller Recovery Influencing Factors

This Chapter builds on the findings from past research of relevance to controller

recovery (Chapter 5) further augmented by the operational experience extracted from

the questionnaire survey (Chapter 6) to realise a detailed understanding of the context

that surrounds a controller during the occurrence of an unexpected equipment failure.

The Chapter starts by illustrating the importance of the impact that contextual factors

have on controller recovery from equipment failures in Air Traffic Control (ATC). It

reviews both Air Traffic Management (ATM) and non-ATM related Human Reliability

Assessment (HRA) techniques to assure a comprehensive investigation of contextual

factors relevant to controller recovery from equipment failures in ATC. This initial

selection is augmented by the findings from the equipment reliability literature,

operational failure reports, human reliability research, and interviews with ATM

specialists. The Chapter concludes by identifying a set of relevant contextual factors,

referred to as ‘Recovery Influencing Factors’ (RIFs), and their qualitative descriptors or

the levels of their influence on controller recovery performance.

7.1 Relevance of the recovery context

Analyses of accident investigations in various industries (e.g. aviation, nuclear and

chemical) have revealed that it is not possible to gain a full understanding of the

cause(s) of an accident from factual data alone. For example, the US National

Transportation Safety Board (NTSB) conducted dozens of detailed accident

investigations in which the teams of experts managed to assess different contributory

factors and identified various issues with task design, procedures, cultural issues

(mostly relevant to language barriers within pilot-controller communication), personal

factors (e.g. a shift in attention in L-1011 1972 accident in Everglades; NTSB, 1973),

weather (e.g. the Pan Am Flight 759 accident was due to thunderstorm and wind shear;

NTSB, 1983). Such factors can help explain why errors occur. Additionally, the

description of the context may also serve as a basis for defining ways of preventing or


179

reducing specific types of erroneous actions by means of technical recovery (i.e. built-

in defences) and human recovery.

It is also necessary to take into consideration contextual factors that traditionally may

not be recorded by investigating bodies, but which can have a significant impact on the

outcome of an accident. In support of this, Dekker et al. (2004) note that it is

“necessary to capture both a situation in which the action takes place and the action

itself”. Similar arguments were presented by researchers at the National Aeronautics

and Space Administration (NASA) Ames Research Centre, who pointed out that "we

must move beyond trying to pin the blame for accidents on a culprit but seek instead to

understand the systemic causes underlying the outcomes" (cited in Cox, 2005). The

research presented in this thesis expands the analysis of equipment-related incidents

to include the context in which controller recovery unfolds. Therefore, the objective of

this Chapter is to determine the relevant contextual factors that affect the process of

controller recovery from equipment failures in ATC.

In Air Traffic Management (ATM), the contextual factors relevant to controllers are

defined as “internal or external factors which influence the controller’s performance of

ATM tasks” (EUROCONTROL, 2002b). It is notable that this definition is generic and

thus does not give an indication as to when it is appropriate to stop looking further for

contextual factors. The so-called ‘stopping rule’ is taken to be directly linked to the

overall investigation process, where assessment of contextual factors represents only

one segment of that process. In other words, it is the role of the investigator to

determine the chain of events that constitute a safety-relevant occurrence. In this

respect, the analysis of contextual factors should cover the entire chain and assess the

relevant context for each link in the chain. The research presented in this thesis adapts

the EUROCONTROL definition of contextual factors. Hence, the contextual factors in

this research or ‘Recovery Influencing Factors’ (RIFs) are defined as internal or

external factors that influence the controller’s recovery from unexpected equipment

failures in ATC.

The factors extracted from the various techniques are known in the HRA literature as

Contextual Conditions – CCs (EUROCONTROL, 2002b), Performance Shaping

Factors - PSFs (Shorrock, 1992; Shorrock and Kirwan, 2002; EUROCONTROL, 2004e;

THEMES, 2001; Swain and Guttman, 1983), Error Producing Conditions – EPC

(EUROCONTROL, 2004d; Williams, 1986), Common Performance Modes – CPMs


180

(Hollnagel, 1993), Common Performance Conditions – CPCs (Hollnagel, 1998), or

Recovery Influencing Factors – RIFs (Kanse and van der Schaaf, 2000).

However, not all contextual factors are appropriate to describe the context around

recovery from equipment failures. This is because, firstly many factors have been listed

and recognised as generic factors without a good understanding of their influence

specifically on the recovery process. Secondly, many of the existing contextual factors

are derived from the nuclear and process industries. Such factors are not always

transferable to the highly dynamic and time-dependant ATC environment. Thirdly,

some of the past research was based on the models of human performance not

representative of specific ATC tasks.

It should be noted that the research presented in this thesis does not rely exclusively

on any particular model of human information processing. Instead, it simply assesses

the importance of the recovery context and aims to derive a set of contextual factors

that best determines the controller recovery performance. The following section

presents two equipment failure incidents to highlight the importance of the context in

which controller recovery takes place.

7.1.1 Examples of the recovery context

Two real examples taken from an incident database of a Civil Aviation Authority (CAA)

are presented below to illustrate the relationship between failure, recovery, and

contextual factors. Because of their confidential nature, the examples are de-identified.

Although brief in the description of equipment failure, the two reports identified various

contextual factors and their impact on controller performance.

The first report contained the following: “At 2230 advice was received that there would

be a load test performed on the electrical system which would involve changing from

mains power supply to generators. Assurance was received that there would be no risk

of service interruption. Shortly after the power changeover two XX consoles crashed

followed by the remaining two. The Voice Switching Communication System (VSCS)

also failed as did the wall clock adjacent to the XX area. At the same time the simulator

also failed.” It was subsequently established that the root cause of the reported failure

had been within the ATC organisation which did not set up appropriate maintenance

procedures on the ‘live’ ATC system (i.e. organisational factor). Additionally, this report

highlighted the relevance of other contextual factors such as: the number of

workstations/sectors affected (i.e. loss of four workstations and the simulation platform),


181

time course of failure development (i.e. sudden failure), and complexity of failure type

(i.e. multiple failure: several workstations, clock, and simulation platform affected).

The second report contained the following: “The loss of radar display and VSCS at a

time of moderate traffic (approximately 10 aircraft on frequency) created substantial

workload on the controller. Thankfully, there were two controllers in the near vicinity

who were able to assist with a transition to a nearby controller working position and to

help maintain situational awareness and communications with the various aircraft via

air-ground (AG) bypass.” This report highlighted the impact of traffic complexity at the

moment of failure occurrence (i.e. ten aircraft in simultaneous communication with the

controller), personal factors (i.e. substantial workload), communication for recovery

within a team (i.e. assistance with handling the traffic and maintaining traffic awareness

in spite of the loss of all critical systems: visual representation of traffic on display and

direct communication with relevant aircraft), adequacy of organisation (i.e. availability

of additional support), number of workstations affected (i.e. one workstation), and

complexity of failure type (i.e. multiple systems affected: radar display and

communication system).

The two brief cases above taken from an incident database illustrate the important

relationship between failure, recovery, and relevant contextual factors. In other words,

these equipment failure examples have shown that the context in which human

performance takes place is important in understanding human reliability. Although the

examples do not convey the complete picture of the occurrence of equipment failure

(e.g. no mention of any personal issues in the first example, weather), several

contextual factors have been captured. As a result, research on controller recovery

from equipment failures in ATC requires a precise definition of the context surrounding

any failure type. In order to achieve this objective, it is necessary to review the specific

contextual factors defined in various HRA techniques. This is used together with

information from equipment reliability literature to identify the ‘Recovery Influencing

Factors’ (RIFs).

7.2 Methodology to extract the candidate set of contextual factors

In order to determine a candidate set of contextual factors relevant to controller

recovery from ATC equipment failures, it is necessary to start with a review of

contextual factors as identified in the most relevant current HRA techniques (i.e. ATM-

specific HRA techniques). It is important to highlight that this overview is not focused


182

on human error per se or the underlying human information processing theory. The

literature on human error has been used simply to investigate the relevant factors that

influence the human performance in unusual/unexpected events (i.e. contextual

factors). As a result, human information processing theories used in assessed HRA

techniques are outside the scope of this thesis.

It is also important to note that although there are currently three HRA techniques used

in the ATM sector, the review presented here has also considered other HRA

approaches employed in other domains to assure a complete set of RIFs. Furthermore,

a review of relevant equipment-failure characteristics and dynamic situational factors

has been conducted in order to augment the results from the review of the HRA

techniques. This is to ensure a complete and reliable determination of the RIFs. The

RIFs are then verified by interviews with ATM specialists. Figure 7-1 presents the

methodology used in this thesis to extract a candidate set of contextual factors relevant

to controller recovery from ATC equipment failures.

Methodology to extract a

candidate set of Recovery

Influencing Factors (RIFs)

Augmentation with

dynamic situational

factors

Augmentation with findings from other HRA techniques

ATM related

HRA techniques

Augmentation with equipment-failure

related characteristics

Output

Output

Output

Output

Identified gaps

Identified gaps

Identified gaps

Verification of

selected RIFs by

two ATM Specialists

Figure 7-1 Methodology to extract a candidate set of RIFs


183

7.2.1 Human reliability assessment techniques

The methodology for the selection of contextual factors relevant to controller recovery

starts with a review of contextual factors as identified in the most relevant current HRA

techniques.

7.2.1.1 Human Error in ATM (HERA)

The HERA project represents the most recent approach for the analysis of human error

in the ATM domain. It evolved because of European and US initiatives1 to produce a

distinctive HRA tool. HERA is based on an extensive literature review and the

operational involvement of air traffic controllers, incident investigators, and safety

managers. The HERA project developed an initial set of CCs for ATM based on the UK

incident reports, discussions with controllers, and vast literature on human factors

(EUROCONTROL, 2002b; EUROCONTROL, 2003d; EUROCONTROL, 2003e;

EUROCONTROL, 2004d). HERA uses eleven groups of Contextual Conditions (CCs)

to define context: pilot-controller communications, pilot actions, traffic & airspace,

weather, documentation & procedures, training & experience, workplace design & HMI,

environment, personal factors, team factors, and organisational factors. Each of the CC

groups is further sub-divided, resulting in more than 200 contextual factors. HERA

recommends that CCs should be applied individually to each error that occurred during

an incident, rather than just once for the entire incident. This supports the concept

presented in this thesis that analysis of contextual factors should cover the entire chain

of events leading to an incident. Thus it should assess contextual factors relevant for

each link in that chain (see section 7.1).

The majority of contextual factors defined in HERA are relevant to controller recovery

from equipment failures in ATC. Thus, the HERA technique represents a good starting

point for compiling a list of RIFs. For example, severe weather conditions can degrade

controller performance by adding additional workload to the already complex recovery

task. As such weather should be incorporated in the list of RIFs.

There are also some factors defined in HERA that are not applicable to the recovery

from equipment failure in ATC. For example, pilot actions are relevant to ATM but not

ATC. Therefore, this particular factor will be excluded in the final choice of RIFs.

1 The US Federal Aviation Administration (FAA) developed the Human Factors Analysis and

Classification System (HFACS) tool.


184

Additionally, pilot-controller communication is not relevant in the immediate event of

equipment failure. Although not addressed in this thesis, there are circumstances when

pilot actions are of importance, such as in the case of a major failure or when

unplanned or erroneous pilot actions result in the increase of controller workload. More

important than the example above is the communication between a team of controllers

for efficient recovery. In this respect, communication (for recovery) and team factors

could be combined to create one factor since the entire team interaction takes place

through the communication for recovery. Only in the event of severe equipment failure

(i.e. a failure that adversely affects the availability of an Air Traffic Service-ATS over a

significant period), is a controller obliged to inform all traffic (i.e. pilots) in the affected

airspace of a reduced level of ATS. Finally, there is a tendency to exclude

environmental issues, when looking at more specific events, such as equipment failure,

on the basis that controllers are familiar with working in a specific ATC Centre. This is

discussed further in section 7.2.1.3.

7.2.1.2 Technique for the Retrospective and Predictive Analysis of Cognitive Errors in ATC (TRACEr)

This approach was developed by the UK National Air Traffic Services (NATS) to gain a

better understanding of controller error. It is a model-based approach, which performs

both a retrospective and a prospective analysis. The original version of TRACEr

contains eight different taxonomies; one of which describes context (Shorrock, 1992;

Shorrock and Kirwan, 2002). The CC groups derived in HERA were based largely on

the context defined in TRACEr. The TRACEr technique uses the Performance Shaping

Factors (PSF) taxonomy and “classifies factors that have influenced or could influence

controller performance, aggravating the occurrence of errors, or perhaps assisting error

recovery” (Shorrock and Kirwan, 2002). Thus, it can be concluded that TRACEr defines

context in a similar way to HERA, i.e. by defining relevant groups of PSFs. As with

HERA, each PSF group is further sub-divided, resulting in approximately 60 PSFs in

the TRACEr Light version. The PSF groups recognised by TRACEr are: traffic and

airspace (e.g. traffic complexity), pilot/controller communications (e.g. RT workload),

procedures (e.g. accuracy), training and experience (e.g. task familiarity), workplace

design, HMI and equipment factors (e.g. radar display), ambient environment (e.g.

noise), personal factors (e.g. alertness/fatigue), social and team factors (e.g.

handover/takeover), and organisational factors (e.g. conditions of work).


185

The main difference between TRACEr and HERA is that the former does not include

pilot actions and weather (see Appendix VII). Thus, no additional candidate factors

could be extracted from TRACEr.

7.2.1.3 Recovery from Automation Failure (RAFT) Tool

As previously discussed in Chapter 5, this tool has been developed as a part of the

“Solutions for the Human-Automation Partnerships in European ATM (SHAPE)” project,

managed by the Human Factors Division of EUROCONTROL. The SHAPE project

defines context as “any aspect of the operating environment that can influence a failure

or recovery process” (EUROCONTROL, 2004e). The project focused on the contextual

factors affecting recovery, which is in line with the objective of this thesis. The relevant

contextual factors or PSF categories recognised in RAFT are: task load and system

complexity, pilot-controller communication, procedures and documentation, training

and experience, human-machine interaction, personal factors, social and team factors,

logistical factors, and other organisational factors.

A review of the RAFT PSFs shows that ‘task load and system complexity’ represents a

workload facing the controller as a result of task performance and overall system

complexity. Therefore, this factor has a potential to be included as a RIF. Compared to

HERA, RAFT disregards ‘pilot action’, ‘weather’, and ‘environment’ as relevant

contextual factors for human recovery from equipment failure in ATC. Whilst pilot

actions do not have much impact as explained in section 7.2.1.1, weather can bring

additional complexity to the occurrence of equipment failure. At the same time, RAFT

includes a ‘new’ category called ‘logistical factors’, which includes maintenance and

staffing issues.

Environmental issues (e.g. noise, temperature, and lighting) are excluded. The reason

for this is that controllers are used to ambient characteristics by working in a specific

ATC Centre. On the other hand, logistical factors will be assigned to the existing

organisational factors category. The reason for this lies in the fact that staffing and

maintenance issues should be anticipated and pre-planned at organisational or

managerial level (e.g. maintenance scheduling, availability, and assignment of

personnel, stock of equipment and spare parts, on-the-job training aids). The

management in any ATC Centre should anticipate as far as possible unscheduled

technical disturbances and provide necessary defences for their prevention.


186

The three techniques (HERA, TRACEr, and SHAPE/RAFT tool) above were developed

specifically for the ATC/ATM environment. In general, they defined context and

contextual factors in a similar way as it is defined in this thesis. The assessment of

these three models identifies a total of nine candidate RIFs. These are: communication,

traffic and airspace, weather, procedures, training and experience, HMI, personal,

organisational factors, and task complexity.

Whilst the review of ATM related HRA techniques gives many relevant contextual

factors, it worth examining relevant non-ATM HRA techniques to investigate if other

factors exist. The following sections provide an insight into the relevant findings.

7.2.1.4 Recovery from failures: understanding the positive role of human operators during incidents

This research attempted to emphasise the positive role of human operators in the

overall system performance. In addition, it proposed a preliminary failure compensation

process model (or recovery model) derived initially for the chemical process industry.

Furthermore, the importance of a taxonomy used to describe the factors influencing

recovery was recognised. Based on the experience gained from field studies and the

relevant literature, Kanse and van der Schaaf (2000) developed a list of RIFs. In their

research the recovery factors were defined as factors that contribute to human

recovery performance once an error or failure has occurred. This definition

corresponds to the definition of RIFs adopted in this thesis. A categorisation into six

groups of RIFs adopted by Kanse and van der Schaaf (2000) from the power plant

industry is presented in Table 7-1.

Table 7-1 Factors influencing recovery from failures (from Kanse and van der Schaaf, 2000)

Categories of factors Recovery Influencing Factors

Prioritisation of recovery-related tasks

Time available for recovery task, considering other tasks requiring attention Urgency of recovery (amount of time until negative consequence arise) Importance of or need for recovery (seriousness of possible consequences if not recovered)

Occurrence-related

Type(s) of preceding failures Performance phase in which the immediate result of the failure process is detected (during the planning phase/ while carrying out the action/when the outcome of the action is observable) Available and applicable barriers/defences

Human (person) related

Overall work area knowledge Work area and process related skills General competency in job Time elapsed since last (re)training in work area Time since last (re)training with regard to specific problem occurrence Suspicion/distrust/intuition


187

Personal attitude toward failure and failure compensation System failure coping strategies Self-efficacy (trust in own ability), self esteem Fatigue; Shift work coping ability Feeling of personal responsibility for the failure or problem Feeling of personal responsibility with regard to recovery Pride regarding job well done Previous experience with failures (any type) Previous experience with this failure (any type)

Social

Team attitude toward failures and failure compensation Attitude toward teamwork; Team efficacy Feeling of team responsibility for the failure or problem Feeling of team responsibility with regard to recovery

Organisational

Availability of team members/colleagues Organisation of work and responsibilities Training plan; Competency assessment plan Supervision; Personnel selection processes Availability, quality and usability of procedures/instructions Shift patterns and personnel planning Organisational policy Management attitudes towards failures & failure compensation

Technical/workplace/situational

Availability of equipment/materials needed Operator-process interface properties

The majority of the identified factors are relevant to equipment failures in ATC and

should be considered as potential RIFs. For example, ‘available and applicable

barriers/defences’ are important with respect to detection, diagnosis, and correction of

equipment failure. Time pressure is recognised under the ‘prioritisation of recovery-

related tasks’. Equipment failures in ATC are unexpected events, which degrade the

ATC service offered. In this case controllers are still required to provide a service to

ensure a safe flow of traffic. As a result, controller workload increases rapidly

potentially compromising controller performance. Therefore, this factor should be

analysed for potential inclusion into the RIFs. Occurrence-related factors are mostly

applicable to the power plant environment and as such could not be directly applied to

ATC. However, if transferred to the characteristics of the ATC environment, these

factors may be relevant to equipment failure occurrence.

7.2.1.5 Computerised Operator Reliability and Error Database (CORE-DATA)

The CORE-DATA database was developed at the University of Birmingham to assist

the UK personnel involved in the assessment of hazardous systems such as nuclear,

chemical, and offshore systems (Kirwan, Basra, and Taylor-Adam, 1997;

EUROCONTROL, 2002b; EUROCONTROL, 2004d). It represents an attempt to

develop a systematic approach to recording human errors. Several sources of data are

used to populate the database including: real operating experience (incident and

accident reports), simulation (both training and experimental simulators), experiments

(from literature on performance), expert judgment (e.g. as used in risk assessments),


188

and synthetic data (from human reliability quantification techniques). According to

EUROCONTROL (2002b), CORE-DATA contains approximately four hundred data

records describing particular errors that have occurred, together with their causes, error

mechanisms, and their probabilities of occurrence. PSFs are defined in CORE-DATA

as underlying causes which influence human performance and indicate how the human

error occurred. CORE-DATA’s PSF taxonomy consists of alarms, communication,

ergonomic design, ambiguous HMI, HMI feedback, labels, lack of supervision/checks,

procedures, refresher training, stress, task complexity, task criticality, task novelty, time

pressure, training, and workload.

There are a number of factors here of potential relevance to ATC and controller

recovery. Firstly, alarms should be considered as a particular type of technical built-in

defence (discussed in Chapter 4) and are therefore, important with respect to detection,

diagnosis, and correction of equipment failure. This is also in accordance with the work

done by Kanse and van der Schaaf (2000) as explained in the previous section. Hence

‘alarm’ should be considered as a potential RIF. Secondly, task novelty or task

familiarity in the case of equipment failures in ATC should be considered under the

training and experience RIF. Thirdly, time pressure has also been recognised in the

work done by Kanse and van der Schaaf (2000) under the ‘prioritisation of recovery-

related tasks’. Therefore, this factor should be analysed for inclusion into the RIFs.

7.2.1.6 Technique for Human Error Rate Prediction (THERP)

The THERP technique was developed by Alan Swain at Sandia National Laboratories

in the 1950's (Swain and Guttman, 1983; Straeter, 2000). The THERP technique

assumes that human information processing can be influenced by error conditions

(Performance Shaping Factors-PSFs). THERP subdivides all PSFs into internal,

external, and those that act as physiological and psychological stressors. However, the

ways in which PSFs act on human performance are not explicitly specified.

Furthermore, THERP sub-divides external PSFs into situational factors, task factors,

and task instructions. Internal factors are defined as factors related to the organism (i.e.

human factors). The PSFs recognised in THERP are presented in Table 7-2.

Table 7-2 Factors influencing human actions in THERP (cited in Straeter, 2000)

Category Factors influencing human actions

External Performance Shaping Factors

Situational factors

Design features; Quality of environment; Temperature, air humidity, air quality, radiation exposure, illumination, noise, vibration, cleanliness; Working hours; Breaks; Availability of special work resources; Job manning; Organisational structure (authority, responsibility, channels


189

of communication); Actions by shift leader, worker, manager, supervisory authority); Remuneration structure (recognition, payment)

Factors in tasks and work resources

Requirements for perception; Requirements for motor system (speed, power expenditure, accuracy); Relationship between operators and display; Requirements for adaptation; Interpretation; Decision making; Complexity (information loading); Narrow nature of task; Short term and long term memory; Calculations; Feedback (knowledge regarding results of an action); Dynamic of gradual actions; Group structure and communications; Man-machine factors; Interface (design of work resources, test instruments, maintenance equipment, work aids, tools, accessories)

Work and task instructions

Required procedures (written, non-written); Written and verbal communication; Warnings and danger signs; Work-methods; Plant policy

Stressors

Psychological stressors

Suddenness of occurrence; Duration of stress; Task speed; Task load; High hazard risks; Threats (fear of failure, loss of job); Monotony, degrading or meaningless activities); Duration of uneventful periods of alertness; Work performance motive conflicts; Reinforcement of missing or negative sensory deprivation; Detractors (noise, blinding, motion, flickering, coloration); Inconsistent labelling

Physiological stressors

Duration of stress; Fatigue; Pain or discomfort; Hunger or thirst; Extreme temperatures; Radiation; Extreme gravitational forces ; Extreme pressure conditions ; Inadequate oxygen supply; Vibration; Restricted movements; Absence of physical exercise; Interruption of circadian rhythm

Internal Performance Shaping Factors

Factors relating to the organism (i.e. human factors)

Prior training, experience; State of momentary practice or abilities; Personality and intelligence variables; Motivation and attitudes; Emotional states; Stress (mental or physical); Knowledge about demanded performance prerequisites; Gender differences; Physical conditions; Attitudes deriving from family or groups; Group dynamic processes

A review of the contextual factors relevant to THERP reveals that most can be

allocated to the RIFs identified by the first three ATM-related techniques. Several other

factors, such as decision-making, short-term, and long-tem memory (external PSF)

may be categorised as personal factors. These factors may become increasingly

important within the planned modernisation of ATM (i.e. datalink, electronic strips, or

‘stripless’ environment). Finally, the suddenness of occurrence factor identified in

THERP is not possible to categorise within existing RIF groups. This factor is relevant

for the occurrence of equipment failure in ATC environment as it greatly affects the

controller detection. Hence it should be treated as an additional potential RIF.

7.2.1.7 Human Error Assessment and Reduction Technique (HEART)

The HEART technique was developed by Jeremy Williams, a British ergonomist, in

1985. The review of this technique is available in EUROCONTROL (2004d) and


190

Williams (1986). It is one of the most popular human error quantification techniques

due to its ease of implementation and is still used extensively in the nuclear, chemical,

petrochemical, railway, and defence industries.

HEART was derived from a wide range of findings in ergonomics literature. The

technique defines a set of generic error probabilities for the tasks considered, and

identifies the Error Producing Conditions (EPC) associated with these. EPCs include

particular ergonomic, task (e.g. inactivity, repetitious, or low mental workload tasks,

additional team members necessary to perform task normally), and environmental

factors that could each have a negative effect on human performance. In other words,

the definition of contextual factors or EPCs emphasises purely their negative impact on

human performance. The extent to which each EPC factor affects performance is

quantified and the human error probability is calculated as a function of the precise

effect of each EPC on a particular task. HEART assumes that basic human reliability is

dependent upon a generic nature of the task to be performed and that under nominal

conditions this level of reliability will tend to be consistent (Williams, 1986).

This technique identified 38 different Error Producing Conditions (EPC). These can be

categorised into two groups, those directly transferable to ATC and those that are not.

The EPCs relevant to ATC can be further sub-divided into those that fit within existing

RIF categories and those that do not. The former are, for example, ‘unfamiliarity with a

situation which is potentially important but which only occurs infrequently or which is

new’, ‘a shortage of time available for error detection and correction’, and ‘a channel

capacity overload’. The EPC concerned with ‘unfamiliarity with a situation’ may be

captured through two RIFs i.e. training and experience. Unusual or emergency

situations (such as ATC equipment failures) are rare but highly demanding events that

require efficient and effective response from each controller. Regular and

comprehensive training plays a key factor in building the skills and experience

necessary to cope with such unusual situations. ‘Shortage of time available’ has

already been discussed and recommended to be included as a candidate RIF (see

section 7.2.1.5). Finally, ‘channel capacity overload’ is a term used for the workload

caused by simultaneous presentation of critical information to the human operator. As

such it can be classified under personal factors.

The EPCs not relevant to ATC include several factors. For example, a category

‘mismatch between the educational level and the requirements of the task’ is not

applicable to controllers. The level of education and training for ATC licence is


191

standardised and reflects the knowledge controllers should acquire. Furthermore, the

category ‘an incentive to use more dangerous procedures’ is also not applicable to

ATC as ‘dangerous’ procedures or working practices are direct violations of the rules.

7.2.1.8 The Contextual Control Model (COCOM)

The COCOM model, developed by Hollnagel (1993), describes how human

performance is dynamically determined by the current context, as an alternative to the

common information processing models. This is a generic HRA approach not related to

any specific industry.

COCOM represents a control model of cognition focusing on two important aspects:

the conditions under which a person changes from one mode to another and the

characteristics of human performance in a given mode. COCOM recognises four

control modes: scrambled, opportunistic, tactical, and strategic. According to this

approach human actions are determined by the context as well as specific

characteristics and mechanisms of human cognition. In Hollnagel’s view, humans do

not passively react to events, they actively look for information and act based on

intentions as well as external developments. Therefore, it was concluded that human

actions are only meaningful when considered in the appropriate context.

In this regard, COCOM defines Common Performance Modes (CPM) as the conditions

under which the human performance takes place. Hollnagel (1993) divides them into

CPMs that may increase or decrease human reliability. The former include sufficient

available time, available plans, adequate Man Machine Interface (MMI) and support,

few simultaneous goals, normal/familiar process state, and adequate organisation. The

CPMs that may reduce reliability include insufficient available time, plans not available,

inadequate MMI and support, many simultaneous goals, abnormal process state, and

inadequate organisation.

According to Hollnagel (1993), the objective is not to find a precise probability of a

specific action but rather to identify the specific steps, which are particularly prone to

produce hazardous consequences. This knowledge can then be used to change the

design of the system, to introduce specific measures of compensation, and to construct

defences and recovery options. Generally, the objective of the recovery performance

assessment should be to identify the context that is likely to result in an inadequate

recovery performance. The characteristics of the context resulting in an inadequate

recovery performance would be used to define the necessary changes to the ATC


192

system/component design (e.g. technical defences, recovery procedures and training).

This should allow the whole ATC system to be safer and more reliable.

The COCOM technique was subsequently used in the development of another method

discussed in the next section. Therefore the final choice of potential RIF factors from

both techniques is discussed within the next section.

7.2.1.9 Cognitive Reliability and Error Analysis Method (CREAM)

The CREAM methodology represents a further development to the COCOM model that

deals with the duality of competence and control in human cognition (Hollnagel, 1998).

Basing the work on COCOM’s model of cognition and four distinctive control modes,

CREAM represents a practical approach for both human performance analysis (i.e.

retrospective analysis) and performance prediction. The method is cyclical rather than

sequential and has well-defined conditions that identify when an analysis should end.

Similar to COCOM, CREAM represents a generic approach not related to any specific

industry.

Using past research (i.e. THERP technique), Hollnagel (1998) attempts a more

structured approach where related categories of contextual factors are grouped

together. As a result he defines a small set of Common Performance Conditions (CPCs)

that contain the general determinants of performance (i.e. common modes) including:

adequacy of organisation, working conditions, adequacy of MMI and operational

support, availability of procedures/plans, number of simultaneous goals, available time,

time of day (circadian rhythm), adequacy of training and experience, and crew

collaboration quality. The proposed CPCs were intended to have a minimal degree of

overlap, although they are not independent.

Hollangel (1998) argues that there is a significant similarity between PSFs and CPCs.

However, the difference lies in the scope of these factors. Similar to CPMs in the

previous COCOM technique, CPC categories are more generic conditions and

designed to be applied in the early stage of the analysis to characterise the context for

the entire human operational task. On the other hand, PSFs tend to be more specific

and focused on a particular stage of that task.

Hollnagel (1998) went one-step further to define the levels that each CPC can take and

their appropriate effects on performance reliability (the so called ‘typical values’ of

CPCs). These levels are based on general human factors knowledge and experience


193

from the HRA discipline. Hollnagel used the general principle that advantageous

performance conditions improve reliability, whereas disadvantageous conditions are

likely to reduce it. If reliability is improved, operators are expected to fail less often in

their tasks and perform better in general. He proposed an expected effect of each CPC

on performance reliability at three levels: improved, not significant, or reduced. The

advantages of this approach can be seen in the direct link between the descriptors

used for CPCs and expected effect on human performance reliability. As such, the

research presented in this thesis adopted this approach (further explained in section

7.3).

In order to determine the overall effect of the context on human performance, the

CREAM technique assumes an expert judgement of the relevance of each CPC for the

particular event under investigation and its impact on the probability of failure (no

impact, improves, reduces). The resulting score is used to determine the expected

control mode, which, as previously mentioned, is: scrambled, opportunistic, tactical, or

strategic control.

Taking account of the review of both the CPMs (COCOM) and CPCs (CREAM), the

majority of the factors identified are directly transferable to ATC. The exceptions are

the number of simultaneous goals and normal/familiar process state (see Appendix VII).

Regarding the number of simultaneous goals, it is important to highlight that air traffic

control implies the simultaneous processing of multiple tasks. In other words, a

controller may be in radio contact with 10-20 aircraft simultaneously performing

computer-related tasks (e.g. entering assigned altitude information, handing off flights

to another controller). Therefore, high levels of multitasking remain inherent

characteristics of ATC (Wickens, 1992) and as such will be excluded from the list of

RIFs. The other factor (normal/familiar process state) is highly relevant to the recovery

performance but has to be indirectly mapped with training and experience.

7.2.1.10 Human Reliability Management System (HRMS)

The HRMS technique was developed to derive a comprehensive and accurate

assessment of human contribution to risk in the nuclear industry, through a detailed

task and error analysis, quantification, and practical error reduction scheme. Since this

technique was too resource-intensive, it was necessary to additionally develop a fast

screening technique. This ‘light’ version required a detailed approach only for those

scenarios, which showed critical human involvement. This led to a subsequent

technique, the Justification of Human Error Data Information (JHEDI). Six PSFs were


194

identified based on the assessment of several HRA techniques (Kirwan, 1997): time,

quality of information and interface, training/expertise/experience/competence,

procedures, task organisation, and task complexity. Context is defined as complete

task design, the working and organisational environment, and the entire history of the

task and individual(s) performing the task. In fact context encompasses all the

conventionally-used PSFs, plus the myriad of factors, including culture, many too

microscopic and idiosyncratic, or even possibly too macroscopic and intangible to allow

a tractable predictive analysis (Kirwan, 1997).

The HRMS approach is based on its own audit document and consists of fifty questions

as an assumed limit for an acceptable and practicable tool. The expert inputs to each

of these questions (‘yes, ‘no’, ‘not applicable’) are used to rate each PSF, ranging from

zero to ten, where a value of zero represents a near-perfect design and ten a poor

design. As a result, a profile of PSFs is created for each task and further linked to the

known value of human error probability for that task (extracted from the available

incident database). The quantitative assessment of each new task comprises of its

comparison with known tasks (and their PSF profile) and deriving an extrapolation rule

to predict its outcome.

Looking at the PSFs identified in HRMS and JHEDI above, it is clear that ‘time’ is an

important factor also relevant to controller recovery. The time it takes to recover from

the occurrence of an equipment failure is important in ATC due to its highly dynamic

nature and the potential for development of an unsafe situation (e.g. loss of standard

separation distance between aircraft). The other factors (e.g. quality of interface,

training, procedures) are also relevant to ATC and are already discussed for their

inclusion as potential RIFs.

7.2.1.11 A Technique for Human Event Analysis (ATHEANA)

The US Nuclear Regulatory Commission supported the development of ATHEANA as

a technique to overcome the shortcomings of the first generation HRA techniques

(Nuclear Regulatory Commission, 1998). ATHEANA is a context driven technique in

the identification and analysis of human failure events. This technique was intended to

provide a means for analysing Errors Of Commission (EOC). ATHEANA moved away

from random human errors under nominal conditions to errors which result from error-

forcing contexts. According to ATHEANA, an error-forcing context comprises of two

components (i.e. plant conditions and associated PSFs) and is associated with (human)

unsafe actions. Thus, the emphasis is placed on the negative impact of context on


195

human performance (similar to HEART technique). ATHEANA borrows its methodology

from HEART (see section 7.2.1.7) but accounts for various plant conditions into the

analysis. Starting from the basic scenario (i.e. nominal plant mode), various alternative

deviation scenarios were developed. The deviation scenarios include additional events

that increase the likelihood of certain error-mechanisms to be triggered (Nuclear

Regulatory Commission, 1998).

As in most other HRA methods, the PSFs derived for ATHEANA are broad categories

which need to be assessed for adequacy by the HRA analyst. These are: procedures,

training, communications, supervision, staffing, human-system interface, organisational

factors, stress, and environmental conditions. All these factors are relevant to controller

recovery from equipment failures in ATC and have already been discussed in the

previous sections.

7.2.1.12 Connectionism Assessment of Human Reliability (CAHR)

The CAHR technique was developed as part of a PhD dissertation and a project for the

German nuclear industry (Straeter, 2000). The objective of this dissertation was to

develop a method for evaluation of human reliability within plant events. The novelty in

this approach is that it is based on very detailed databases introduced to facilitate

international exchange of experiences on events in the nuclear industry. These

databases are: the Nuclear Computerise Library for Assessing Reactor Reliability

(NUCLARR), the Incident Reporting System (IRS), and the German special

occurrences database (BEVOR). These databases collect mandatory occurrences data

to enable international exchange of experiences on events in nuclear systems (Straeter,

2000).

The CAHR technique is based on the evaluation of the operator’s task from the incident

description and identification of interactions between various PSFs. In general, PSFs

are defined here as causes or conditions necessary for the occurrence of an error.

Straeter (2000) considered a weighting scheme for each PSF. Since the available data

sources (i.e. databases) offered a high-level event description, it was possible to move

away from a judgment based categorisation of PSFs towards a more analytical method.

Straeter (2000) determined the frequencies with which a shaping factor was observed

in connection to a human error of a certain type. However, as much as this approach

seems reasonable, it requires access to highly detailed datasets of human reliability

performance. Amongst the investigated events, Straeter (2000) determined 30

conditions under which human errors occurred. These were categorised into six groups:


196

� task (e.g. preparation, simplicity/complexity, precision, time pressure);

� order issue (clarity of procedures, design of procedure, content, completeness,

presence);

� person (e.g. processing, information, goal reduction);

� activity (e.g. usability of control, usability of equipment, monotony, positioning,

quality assurance, equivocation of equipment);

� feedback (e.g. arrangement of equipment, display range, accuracy of display,

labelling, marking, reliability); and

� system (e.g. technical layout, external event, construction, redundancy, coupled

equipment).

The identified PSFs are applicable to recovery from equipment failures in ATC and

have been already considered for the inclusion in candidate RIFs (e.g. task, order issue

- procedures, person, activity – operational support, feedback - HMI). The last CAHR

category (i.e. system) is also relevant as a potential RIF especially as it is deals with

technical layout or system architecture and level of redundancy (as a type of built-in

technical defence). However, these factors are important from a technical point of view

since they directly determine the reliability and availability of the ATC service. The

research presented in this thesis focuses on controller recovery performance once all

redundant systems fail and affect the controller’s ability to control traffic in dedicated

airspace. As a result, more emphasis should be placed on built-in defences

transmitting information to the controller regarding the failure (e.g. alarms, alerts) since

these have an effect on the quality of the controller recovery process (for details see

Chapter 4, section 4.3.2). This also directly corresponds to findings by Kanse and van

der Schaaf (2000) reviewed in section 7.2.1.4.

7.2.1.13 Nuclear Action Reliability Assessment (NARA)

The Nuclear Industry Management Committee (IMC) and British Energy supported an

initiative to produce an enhanced and updated version of the HEART technique

specific to the nuclear industry and known as Nuclear Action Reliability Assessment -

NARA (Kirwan et al., 1994). A review of the data sources used for the original version

of HEART pointed out the need for a detailed human error probability database

(CORE-DATA) which overcame some of the shortcomings detected in the intervening

years. NARA is based on a combination of CORE-DATA and real accident/incident

data available from the nuclear industry, augmented by expert judgement.


197

In this technique, contextual factors are referred to as Error Producing Conditions

(EPCs). However, the set of EPCs included in NARA was based simply on a review of

the data sources used in the original version of HEART. From the original thirty eight

PSFs identified in HEART, eighteen were included in NARA based on the findings from

the research by Kennedy et al. (2000). The factors relevant to controller recovery are

the same as those in the HEART model.

7.2.1.14 Human Performance DataBase (HPDB)

Park et al. (2004) emphasised the need to collect plant-specific or domain-specific data

in order to identify the key factors that can degrade/enhance a plant’s safety. To fulfil

this requirement they initiated the Human Performance DataBase (HPDB) under the

Korean Atomic Energy Research Institute. The objective of this database was to

provide the reliable human performance information needed to perform HRA,

especially for plant-specific emergencies. In order to achieve this objective, they

collected operational emergency reports from regular training sessions. Information

that was considered relevant for an appropriate HRA analysis was grouped under the

following categories:

� available procedure;

� description of the different tasks, steps, and actions, and their dependence;

� demand of perception, cognition, and action to perform necessary tasks and

actions;

� person or team issues;

� level of experience; and

� time needed to correctly perform tasks, steps, and actions.

The third category ‘demand of perception, cognition, and action to perform necessary

tasks and actions’ refers to the operator’s workload. This factor has been assumed

under the personal factors similar to the approach taken in section 7.2.1.5. All other

factors have already been assessed as relevant to the recovery from equipment

failures in ATC.

Similar to the main objective of HPDB, the research presented in this thesis is relevant

to the advancement of knowledge of controller performance under emergency/unusual

situations, such as equipment failure in ATC. Under equipment failure occurrence

controller behaviour tends to differ from the normal everyday routine behaviour. For this

reason, it is necessary to review relevant internal or external factors that influence the

controller’s recovery from unexpected equipment failures in ATC.


198

The discussions presented in the previous sections attempted to extract relevant

factors from various human reliability research to assure the complete presentation of

the recovery context under research in this thesis. The following section gives a

summary of the findings.

7.2.1.15 Summary of the findings

The Recovery Influencing Factors (RIFs) relevant to ATC equipment failure have been

selected on the basis of several sources of information. In general, the definitions of

contextual factors throughout the assessed HRA techniques show great similarity,

where contextual factors are seen as causes, conditions, or factors that influence

human performance. The only difference is observed in three techniques (HEART,

ATHEANA, and CAHR) which focus purely on negative human performance.

The process follows to select the relevant RIFs started with an initial selection based

on the review of contextual factors identified in three ATC/ATM related human reliability

techniques, namely HERA, TRACEr, and RAFT (Table 7-3). As a result, nine groups of

RIFs have been determined as relevant to ATC: communication, traffic and airspace,

weather, procedures, training and experience, HMI, personal factors, organisational

factors, and task complexity. These initial findings are augmented with a review of non-

ATM related HRA techniques (as presented in the previous sections). Therefore, the

second step involved a review of eleven HRA techniques mostly designed to analyse

human error in the nuclear and process industries. These generated additional three

factors of relevance to controller recovery (see Table 7-3).

Table 7-3 Review of Human Reliability Assessment (HRA) techniques and relevant findings

HRA technique

Industry

Terminology used for

contextual factors

Definition of contextual

factors

Extracted contextual factors

HERA ATM Contextual Conditions (CCc)

Corresponds to the definition is this research

� Communication for recovery

� Traffic and airspace � Weather � Procedures � Training � HMI � Personal factors � Organisational factors

TRACEr ATM Performance Shaping Factors (PSFs)

No definition is provided

as above

RAFT ATM as above No definition is provided

� Task complexity


199

Recovery from

failures Chemical

Recovery Influencing factors (RIFs)


� Occurrence-related factors (available and applicable defences such as alarm)

� Group of factors relevant for prioritisation of recovery-related factors (time available/time pressure)

CORE-DATA

Nuclear chemical offshore

Performance Shaping Factors (PSFs)


as above

THERP Nuclear Performance Shaping Factors (PSFs)


� Suddenness of occurrence (or time course of failure development)

HEART

Nuclear chemical

petrochemical railway defence

Error Producing Conditions (EPCs)


as above

COCOM Generic Common Performance Modes (CPMs)

More generic definition

as above

CREAM Generic

Common Performance Conditions (CPCs)

More generic definition

as above

HRMS Nuclear Performance Shaping Factors (PSFs)

Additionally include myriad of other factors

as above

ATHEANA Nuclear Performance Shaping Factors (PSFs)

Emphasis is placed on purely negative context

as above

CAHR Nuclear Performance Shaping Factors (PSFs)

Emphasis is placed on purely negative context

as above

NARA Nuclear Error Producing Conditions (EPCs)


as above

HPDB Nuclear Factors No definition is provided

as above

The assessed HRA techniques and their related factors are presented in tabular form

in Appendix VII. Factors from all techniques are compared to HERA, as the most

recent HRA technique in the ATC/ATM domain. In most cases, the comparison was

straightforward since certain factors were identified in almost all techniques. (e.g. the

factor ‘procedures’). However, a number of factors could not be identified as belonging

to any of the HERA categories and were thus categorised separately (shown as

dashed boxes in Appendix VII). Although these did not specifically ‘fit’ any of the HERA

categories, they were retained because of their relevance to the recovery from

equipment failures in ATC. Table 7-3 gives an overview of the RIFs that are taken

forward for further analysis in the next section.


200

7.2.2 Augmentation with equipment-failure related factors

Once the relevant factors have been determined based on the relevant HRA

techniques (Table 7-3), it was necessary to complement the identified RIFs with

equipment failure related factors. The reason for this is to better reflect the context

surrounding the occurrence of equipment failure and its subsequent controller recovery.

Chapter 4 yielded a further set of recovery factors related to some of the key

characteristics of equipment failures: ATC functionality affected (this is taken into

account separately through the classification of ATC functionalities as defined in

Chapter 2), complexity of failure type, time course of failure development, duration of

failure, impact on operations room (i.e. number of workstations/sectors affected), and

impact on ATC/ATM. As a result, the following RIFs have been added to the previous

list: complexity of failure type, time course of failure development, duration of failure,

and impact on operations room (i.e. number of workstations/sectors affected).

The relevance of the additional equipment-related RIFs has been confirmed in the

analysis of more than 20,000 operational failure reports from four different countries (as

presented in the Chapters 3 and 4). However, even the two brief operational reports

given in section 7.1.1 confirmed the relevance of the equipment-related RIFs, namely

number of workstations affected, time course of failure development, and complexity of

failure type.

7.2.3 Augmentation with dynamic situational factors

It was observed that the chosen RIFs represented more static aspects of the working

environment. As observed by Straeter (2005) dynamic situational factors play an

important role in human decision making and behaviour in emergencies (e.g.

unexpected equipment failure). Straeter (2005) identified a total of seven dynamic

situational factors subdivided into time-related and system-related. Time-related

dynamic situational factors are suddenness of onset of a system development,

operational phase of a task, and involvement of the operator. System-related dynamic

situational factors are: experience with system performance (reliance), conflicting

issues in the situation (task complexity), ambiguity of information in the working

environment, and misleading information processing (priming).

Based on the overview of these seven dynamic situational factors, it was possible to

identify additional three factors relevant to the recovery from equipment failures in ATC.

These are: experience with system performance (reliance), ambiguity of information in


201

the working environment, and adequacy of alarm/alert onset (adapted ‘suddenness of

onset of a system development’ factor). The remaining dynamic situational factors were

either already incorporated amongst candidate RIFs (i.e. task complexity) or were not

considered relevant in the ATM industry (e.g. ‘operational phase of a task’ and

‘misleading information processing’ are more relevant for the non-ATM industries).

7.2.4 Further subdivision of the identified RIFs

In certain cases, the identified recovery factors were too generic to capture the specific

characteristics of the environment at the moment of failure. In order to avoid any

ambiguity, two principles are adopted at this stage of the research. Firstly, each

identified contextual factor is rephrased to better reflect the research presented in this

thesis. For example, ‘communication’ is rephrased to ‘communication for recovery

within team/ATC Centre’. In this way, the selected RIF precisely reflects which segment

of communication is taken into account (i.e. in relation to the recovery process) and

between which parties (i.e. team of controllers or entire ATC Centre). The second

principle represents the subdivision of identified contextual factors whenever necessary

(see Table 7-4). As an example, the ‘traffic and airspace’ factor is too generic to

capture the characteristics of both traffic and airspace and was therefore broken down

into two separate categories. A similar approach is applied to ‘training and experience’.

Table 7-4 Recovery Influencing Factors

Identified contextual factors Corresponding Recovery Influencing Factors (RIFs)

Communication Communication for recovery within team/ATC Centre

Traffic and airspace Traffic complexity during the recovery process Airspace characteristics during the recovery process

Weather Weather conditions during the recovery process Procedures Existence of recovery procedure

Training and experience Training for recovery from ATC equipment failures Experience with equipment failures

HMI Adequacy of HMI and operational support Personal factors Personal factors Organisational factors Adequacy of organisation Task complexity Conflicting issues in the situation (task complexity) Time available & time pressure Time necessary to recover Available and applicable defences and barriers & alarms

Adequacy of alarms/alerts (as part of HMI)

Complexity of failure Complexity of failure type Suddenness of occurrence & Time course of failure development

Time course of failure development

Duration of failure type Duration of failure Impact on operational room (i.e. number of workstations/sectors affected)

Number of workstations/sectors affected

Experience with system performance (reliance)

Experience with system performance (reliance or trust in the system)


202

Ambiguity of information in the working environment


Adequacy of alarm/alert onset Adequacy of alarm onset

7.3 Definition of qualitative descriptors

The final step involves the definition of the qualitative descriptors for each RIF. In this

research, a qualitative descriptor defines the levels of impact that each RIF has in the

context of controller recovery performance. The simplest case would be a dichotomous

descriptor distinguishing only two levels of impact of each recovery factor. However,

this approach is often lacking valuable information and it is not always suitable.

Therefore, qualitative descriptors have been constructed providing three levels of

impact. It starts from Level 1, referring to the most desirable level (in terms of ATC

recovery), toward Level 2, referring to the tolerable or average level, and finishing with

Level 3, referring to the least desirable level. For example, the RIF ‘communication for

recovery within team/ATC Centre’ would have three qualitative descriptors, namely

‘efficient communication’, ‘tolerable communication’, and ‘inefficient communication’.

This approach is similar to that taken in the CREAM technique (Hollnagel, 1998;

section 7.2.1.9).

On the other hand, the RIF ‘Experience with the system performance (reliance or trust

in the system)’ would have two qualitative descriptors. The first would be ‘objective

attitude toward the system’. The second would account for inadequate attitude of the

controller toward the ATC system and would include both ‘positive experience with the

system (overtrust) and negative experience with the system (undertrust)’. In order to

accurately present the levels of impact that this particular RIF has in the context of

controller recovery performance, it was necessary to combine the cases of undertrust

and overtrust in the ATC system. To all extents and purposes, they both have a similar,

undesirable, affect on controller recovery performance. Undertrust in ATC systems

leads to inefficient use of available equipment or all of the available tools. On the other

hand, overtrust leads to complete reliance on the information provided by the system

without consideration of the controller’s own judgement or situational awareness of the

position (lateral and longitudinal) and intent of the traffic within a dedicated airspace.

The above analyses led to a final set of 20 controller Recovery Influencing Factors

(RIFs) divided into four main groups: internal factors (i.e. factors related to the

controller), equipment failure related factors, external factors (i.e. factors related to

working conditions), and airspace related factors. Finally, it has to be noted that the


203

definition of these 20 RIFs assumes that an equipment failure has occurred (i.e.

probability of equipment failure is 1). Otherwise, these 20 RIFs would have to be re-

named and re-defined to allow an analysis of the context surrounding a particular event

under investigation, no longer being an equipment failure. Table 7-5 presents the final

set of factors relevant to the recovery from equipment failures in ATC, together with

their corresponding qualitative descriptors. It has to be noted that these 20 RIFs

represents high-level categories (e.g. personal factors) consisting of several low-level

factors (e.g. age, experience, stress, fatigue). The detailed definitions of these 20 RIFs

in this thesis are presented in Appendix VIII.

Table 7-5 Relevant recovery influencing factors and their corresponding qualitative descriptors

RIF name Qualitative descriptor Level

Inte

rna

l fa

cto

rs

Training for recovery from ATC equipment failure

Suitable to the situation in question 1

Tolerable to the situation in question 2

Counter productive to the situation in question

3

Experience with equipment failures

Experienced a particular type of failure or any other type of ATC equipment failure

1

No experience with ATC equipment failures 2

Experience with the system performance (reliance)

Objective attitude toward the system 2

Positive experience with the system or negative experience with the system

3

Personal factors

Suitable for the recovery process 1

Tolerable for the recovery process 2

Counter productive for the recovery process 3

Communication for recovery within team/ATC Centre

Efficient 1

Tolerable 2

Inefficient 3

Equ

ipm

en

t fa

ilure

rela

ted

fa

cto

rs

Complexity of failure type Single system affected 2

Multiple systems affected 3


Sudden failure 1

Persistent or latent failure 2

Gradual degradation of system 3


One workstation/one sector or all workstations in one sector

2

Several workstations/couple of sectors or all workstations/all sectors

3

Time necessary to recover Adequate 1

Inadequate 3

Existence of recovery procedure



Inappropriate 3

Duration of failure Short period of time 2

Moderate or substantial period of time 3

or

facto

rs

rela

ted to

w

ork

ing

co

nditio

n

Adequacy of HMI and operational support



Counter productive to the situation in 3


204

question


External working environment matches the controller's internal mental model

1

External working environment mismatches the controller's internal mental model

3

Adequacy of alarms/alerts




3

Adequacy of alarm/alert onset

Information from the external world enters the processing loop at the right time

1

Information from the external world enters the processing loop at the wrong time (misleading sequence of alarms)

3

Adequacy of organisation

Efficient 1

Tolerable 2

Inefficient 3

Air

spa

ce r

ela

ted f

acto

rs Traffic complexity during the

recovery process

Average traffic complexity 2

High or low traffic complexity 3

Airspace characteristics during the recovery process

Adequate 1

Tolerable 2

Inappropriate 3

Weather conditions during the recovery process

Improved 2

Deteriorated 3

Conflicting issues in the situation (task complexity)

Average complexity of the situation 2

Conflicting, multiple tasks or extremely low complexity of the situation

3

In order to assure a complete list of relevant contextual factors, a key step at this stage

included verification of the selected RIFs. An initial verification was provided by two

ATM specialists (from one European ATC Centre) with extensive operational

experience. They had an opportunity to review the candidate RIFs, their definitions,

and related qualitative descriptors (for evidence see Appendix II) and their feedback

was valuable in the approval of selected RIFs. Further verification of the selected RIFs

has been conducted in the experiment (presented in Chapters 9 and 10). A discussion

on the process to quantify the probabilistic definition of 20 RIFs, their interactions, and

their influence on controller recovery is presented in more detail in the following

Chapter.

7.4 Summary

This Chapter has had the objective of defining recovery context via a set of contextual

factors, known as ‘Recovery Influencing Factors’ or RIFs. The Chapter has built on the

review of existing HRA techniques and their corresponding contextual factors to identify

which factors are relevant to recovery from equipment failure in ATC. This initial

selection of relevant contextual factors has been augmented with specific equipment


205

failure related factors and dynamic situational factors. The methodology resulted in a

set of 20 controller RIFs. The Chapter concludes with a definition of the qualitative

descriptors for each RIF or the levels of impact that each RIF has in the context of

controller recovery performance. All results obtained have been initially verified by two

ATM specialists who reviewed the choice of selected RIFs and their qualitative

descriptors. The selection of relevant contextual factors (i.e. RIFs) and their qualitative

descriptors are taken forward to the next Chapter to develop the methodology for the

quantitative assessment of the recovery context.

Chapter 8 Quantitative Assessment of Recovery Context

206

8 Quantitative Assessment of Air Traffic Controller Recovery Context

The previous Chapter presented a selection of contextual factors relevant to recovery

from equipment failures in Air Traffic Control (ATC), known as Recovery Influencing

Factors (RIFs). This selection was based on a review of existing Human Reliability

Assessment (HRA) techniques, augmented by specific equipment failure and dynamic

situational factors. A set of 20 RIFs were identified and distributed in four main groups:

internal, equipment failure related, external, and airspace related factors. In order to

facilitate quantitative assessment of the recovery context, the selected RIFs were firstly

assigned potential qualitative levels of impact followed by their quantitative definition

(i.e. probability of each level occurring). The Chapter starts by reviewing relevant past

research to formulate the methodology adopted in this thesis. The proposed

methodology consists of six steps. The qualitative definition of 20 RIFs from the

previous Chapter (Step 1) is followed by the quantitative definition of each RIF (Step 2).

This quantitative definition is based on various sources, such as past literature,

operational failure reports, expert input of eight ATM specialists, and the questionnaire

survey. The Chapter continues by the implementation of all existing interactions

between relevant RIFs (Step 3). These are identified by utilising operational experience

and further validated by past research and expert input. Incorporation of interactions

results in the change of RIF levels that necessitate determination of the cut-off point

between any two consecutive levels (Step 4). Finally, the methodology defines the

relationship between a particular RIF level and its effect on controller recovery

performance (Step 5), to conclude with the definition of a numerical indicator for each

recovery context (Step 6).

8.1 Lessons leant from past research

The review of various HRA techniques (in Chapter 7) identified two issues relevant to

this thesis. Firstly, it identified potential RIFs. Secondly, it revealed the two HRA

techniques which use contextual factors as the basis for quantitative human

performance analysis. These are: the Cognitive Reliability and Error Analysis Method -


207

CREAM (Hollnagel, 1998) and Connectionism Assessment of Human Reliability -

CAHR (Straeter, 2000). A discussion of the CREAM techniques and its relevance to

this thesis is presented in sections 7.2.1.9 and 7.3 of Chapter 7 and will not be

repeated here. However, since the CREAM technique has been further developed in

the work by Kim, Seong, and Hollnagel (2005) and Fujita and Hollnagel (2004), both

approaches have been assessed for their relevance to the research presented in this

thesis.

8.1.1 Applications of the CREAM technique

The application of the CREAM technique by Kim, Seong, and Hollnagel (2005)

attempted a probabilistic determination of contextual factors to determine the relevant

control mode (tactical, opportunistic, scrambled, and strategic control as defined in

CREAM). In short, the authors proposed probability distributions for nine contextual

factors or CPCs, taking into account their dependencies. The advantage of their

approach is the straightforward incorporation of uncertainties. In other words, this

approach is useful in the case of contextual factors which are not clearly defined or

understood. Because of this particular feature, this approach has been adopted in this

thesis.

Furthermore, Kim, Seong, and Hollnagel (2005) link each level of a contextual factor to

a specific type of control and assess all possible contexts using the Bayesian Belief

Network (BBN) approach. Littlewood, Strigini, Wright, and Courtois (1998) state that

the use of BBNs allows safety experts to better handle safety assessment and

potentially make hidden safety arguments more visible, communicable, and auditable.

In general, the concept of BBN is based on a probabilistic approach. It combines expert

input and data, and is useful for building complex and uncertain applications. However,

the approach by Kim, et al. (2005) based on nine CPCs was too complex.

Subsequently, Kim, et al. simplified it by grouping the nine CPCs into the groups of

three, further assessed by the BBN approach. For this reason, a probabilistic approach

based upon C programming codes and the core methodology by Kim et al. (2005) is

used in this thesis to enable incorporation of all 20 RIFs.

The application of the CREAM technique by Fujita and Hollnagel (2004) is designed as

a practical application of CREAM for screening various scenarios and estimating the

failure probability solely from the characteristics of the contextual conditions

surrounding an occurrence (e.g. accident). In this way, the method moves away from

the notion of human error and focuses more on context as a driving force of inadequate


208

human performance, regardless of whether an individual or a team is involved.

Although it demonstrates the usefulness of the CREAM methodology, this method is

not very relevant to this thesis.

8.1.2 Connectionism Assessment of Human Reliability (CAHR)

As previously discussed in section 7.2.1.12 of Chapter 7, CAHR is a data-driven HRA

technique based on highly detailed databases of incident reports in the nuclear industry.

Using the available incident reports, it was possible to move away from an expert

judgment based categorisation of PSFs towards a more analytical method. However,

ATC still lacks a high-level database that captures human performance in the event of

an ATC related incident/accident. Therefore, an analysis of context as performed in

CAHR is still not achievable in the ATC industry. Some initial attempts to establish a

database that captures the human performance data are planned by EUROCONTROL

through the Human Error in ATM (HERA) project (EUROCONTROL, 2002d), but

currently this is incapable of supporting any meaningful statistical analysis.

The following Table 8-1 summarises the characteristics of CREAM, its two main

applications, and CAHR. Section 8.2 builds on the relevant elements of the CREAM

technique to define a framework for the quantitative assessment of recovery context.

Table 8-1 Overview of CREAM and CAHR differences

HRA technique Relevant area Number of contextual

factors

Interaction between

contextual factors Output

CREAM by Hollnagel (1998)

Theoretical approach toward human erroneous

action

Nine Included

qualitatively

Quantitative probabilistic

range

Improvement of CREAM by Fujita and

Hollnagel (2004)

Theoretical approach toward

‘action’ failure rate based on contextual

factors

Ten

Included qualitatively (based on CREAM)

Quantitative mean failure rate

Improvement of CREAM by Kim,

Seong, and Hollnagel (2005)

Theoretical approach toward human erroneous

action

Nine

Included qualitatively (based on CREAM)

Quantitative, probabilistic approach

CAHR by Straeter (2000)

Data driven approach defined

within nuclear industry

Thirty Included

quantitatively using the available data

Connectionism method

facilitating qualitative and

quantitative approach


209

8.2 Framework of the methodology for a quantitative assessment of recovery context

The proposed methodology is ‘generic’ as its aim is to present the framework for a

‘generic’ ATC Centre, as described in Chapter 2, section 2.4. Used operationally, this

methodology would have to be refined to reflect and incorporate all the characteristics

of the ATC Centre or event under investigation.

In general this methodology consists of six steps (Figure 8-1). Firstly, it is necessary to

review the twenty RIFs identified in the previous Chapter and their relevance to the

ATC Centre or event under investigation. In the ‘generic’ approach, all 20 factors are

assessed and defined through their qualitative descriptor or their levels of impact on

controller recovery performance (Step 1). Secondly, based on available sources of

information each RIF is probabilistically defined (Step 2). As a result, it is possible to

present the recovery context as a function of identified RIFs and their corresponding

levels. At this stage, there is no consideration of the interactions between RIFs, as they

are considered to be independent. To provide an accurate approach, Step 3 takes into

account all interactions between RIFs. These are assessed both qualitatively and

quantitatively. This results in a distribution of RIFs levels. Having a distribution of RIF

levels, as opposed to discrete Levels 1, 2 and 3, necessitates identification of the cut-

off point between any two consecutive levels (Step 4). Once these cut-off points are

identified and RIF levels re-defined, the next step quantifies the relationship between

the particular level of RIF and its impact on controller recovery performance. This

relationship is expressed via correlation coefficients (Step 5). At this stage, previously

determined probabilities of each RIF level (Step 2) are re-calculated to account for

RIFs interactions. The result is the definition of an aggregated indicator of the recovery

context, referred to as the recovery context indicator – Ic (Step 6).

The Figure 8-1 below presents the six steps framework of the quantitative assessment

of the recovery context. Since the previous Chapter identified and discussed all 20

RIFs and their levels of impact (qualitative descriptor), the following section discusses

the consequent step, namely probabilistic assessment of RIFs (Step 2). This is

followed by the remaining steps of the proposed methodology (Figure 8-1).


210

Figure 8-1 Framework for the quantitative assessment of the recovery context


211

8.3 Probabilistic assessment of RIFs (Step 2)

Given that the aim of this Chapter is to present a reliable quantitative approach for the

analysis of the controller recovery performance, it is necessary to probabilistically

define levels of influence of each RIF on controller performance (referred to as

qualitative descriptor). As previously discussed in Chapter 7 (section 7.3), the

qualitative and quantitative definition of RIFs assumes that a failure occurred (i.e. that

the probability of failure is 1). In this way, it is possible to define every possible context

as a combination of RIFs and their corresponding levels of influence, i.e. qualitative

descriptor. This approach is important for the prospective analysis of controller

performance, as well as a retrospective event analysis. Even in the case of

retrospective analysis, specifying RIFs exactly is not straightforward due to the lack of

data and information about the context. In the case of predicting future events or

potential hazardous contexts, specifying the RIFs accurately becomes much more

difficult and a level of uncertainty is inherent in the process.

The use of a probabilistic approach has several advantages. Firstly, if a certain RIF is

not clearly specified or known, it is possible to assume probabilities for each of its

levels based on operational data. In this way any uncertainties identified for a certain

RIF can be considered more explicitly as illustrated by Kim, Seong, and Hollnagel

(2005). Another advantage of this approach is that the probability distribution of the

context, and indirectly controller performance, is a result of considering all possible

combinations of contextual factors or RIFs.

The definition of each RIF in terms of the probability of each of its levels is not

straightforward. However, this is necessary for any attempt to quantify the

effectiveness of controller recovery performance in a given context or environment.

Major difficulties are experienced in the quantification of internal RIFs (or factors

related to the controller), as it is hard to quantify any type of human performance. It is

also difficult to quantity some of the equipment failure related RIFs due to the lack of

consistent data collection in the available occurrence reporting schemes. In other

words, some failure characteristics, such as the number of workstations affected, are

not consistently reported. Finally, the majority of the external RIFs are highly ATC

Centre specific and as such extremely hard to define in a generic form. Bearing this in

mind, it is understandable why the quantification of RIFs has been a challenge in the

past.


212

For this reason, it should be noted that this Chapter captures the characteristics of the

‘generic’ ATC Centre as a base for any further fine tuning of the proposed methodology

and its usage as either a retrospective or prospective/predictive tool. Each ATC Centre

has its unique characteristics that may be represented by different RIF probabilities.

For example, the ‘number of workstations/sectors affected’ and ‘complexity of failure

type’ depend on a particular architecture in each ATC Centre, while ‘training for

recovery’ as well as ‘adequacy of organisation’ depend on a particular safety culture.

The framework developed in this Chapter is applied to a unique ATC Centre, presented

in Chapter 10.

8.3.1 Sources of information

A total of four different sources of information have been consulted in order to

determine the necessary RIFs probabilities. These are: operational failure reports

(presented in Chapter 4), the responses from the questionnaire survey (presented in

Chapter 6), responses of ATM specialists, and past literature. Table 8-2 presents the

number of RIFs defined by each available source of information, while the following

paragraphs explain each source in detail. However, two RIFs are not informed by any

of the available sources (‘number of workstations/sectors affected’ and ‘adequacy of

alarm/alert onset’). In these cases, a conservative approach is taken and probabilities

are equally assigned between their levels. Details are presented in Appendix VIII.

Furthermore, three RIFs are informed by combined sources of information (last column

in Table 8-2).

Table 8-2 Distribution of probabilistic RIF ratings per source

Source of probabilistic assessment

Number of RIFs assessed directly (single source)

Number of RIFs assessed indirectly (combined sources)

Operational failure reports - 1 (RIF11) 1 (RIF6) Questionnaire survey 3 - Averaged ATM specialists input

12 1 (RIF11) 1 (RIF3) 1 (RIF6)

Past literature - 1 (RIF3) No available source 2 -

Sum 17 3 (i.e. RIF3, RIF6, and RIF11)

8.3.1.1 Operational failure reports

The probabilistic assessment of the recovery factors is informed by the analysis of

more than 20,000 operational failure reports on equipment failures originating from

three Civil Aviation Authorities (referred to as Countries A, B, and C) and one ATC


213

Centre system control and monitoring database (referred to as Country D). Detailed

analyses of these reports are presented in Chapter 4.

The analyses of operational failure reports are used to inform two particular RIF

probabilities. The first one is ‘complexity of failure type’. The probabilities relevant to

this RIF are determined by tracking the number of reports based on only single failure

compared to those reporting more than one failure. These findings are further validated

by the responses from the eight ATM specialists surveyed. The second RIF is ‘duration

of failure’. This RIF is informed by the analysis of data from Country D database, as it

was the only database that captured duration of failure. These findings are further

validated by the responses from the eight ATM specialists surveyed.

8.3.1.2 Questionnaire survey

The responses from the questionnaire survey, received from 34 different countries,

captured the experiences of more than one hundred air traffic controllers (average

controller experience is 13.8 years, ranging from 1 to 39 years). The detailed

assessment of this dataset is presented in Chapter 6. This source provided an input for

three RIF probabilities. These are: ‘training for recovery from ATC equipment failure’,

‘previous experience with a particular type of equipment failure’, and ‘existence of

recovery procedure’.

The first RIF (‘training for recovery from ATC equipment failure’) is more difficult to

determine compared to other two RIFs. The questionnaire survey determined that 51.7

percent of sampled ATC Centres have established training for recovery (informed

probability of RIF1 defined via Level 1) and that 31 percent have not (informed

probability of RIF1 defined via Level 3). The remaining 17.4 percent of sampled ATC

Centres showed inconsistent responses and this result is translated into the probability

of this RIF1 defined via Level 2 or ‘tolerable’ level. It is assumed that inconsistent

responses on the existence of recovery training, within the same ATC Centre, may

suggest that training is not organised in a consistent manner.

8.3.1.3 Input by ATM specialists

Several probabilities are captured through the input from relevant ATM specialists from

eight similar ATC Centres. The ATM specialists from Ireland, Norway, Sweden, Austria,

New Zealand, Australia, and Japan participated in the small-scale survey. In two cases

the relevant probabilities are captured through face-to-face interviews (with ATM

specialists from Ireland and Norway), whilst in all other cases a predefined set of


214

questions was distributed for self-completion. These questions were designed to

investigate the factors that impact on controller recovery (as defined via 20 RIFs). For

example, their input informed the probabilities which could not be captured using other

sources of information either because of their confidential nature (e.g. ‘time course of

failure development’) or because of the general unavailability of data (‘adequacy of HMI

and operational support’, ‘adequacy of organisation’). The form used with both face-to-

face interviews and self-completion methods of response collection is available in

Appendix IX.

The ATM specialists surveyed have wide ATM operational experience and worked as

either rated air traffic controllers or as engineers in the operational ATM environment.

However, their resident ATC Centres needed to be assessed to establish the level of

similarity that may be reflected in their RIF ratings (Table 8-3). All eight ATC Centres

provide Area Control Service (ACC) while some also provide oceanic air traffic services,

i.e. control of traffic transiting oceanic areas where the absence of radar coverage

necessitates the use of procedural control. Furthermore, six ATC Centres are equipped

with advanced ATC systems, utilising the latest automated tools such as Short Term

Conflict Alert (STCA), Area Proximity Warning (APW), and Minimum Safe Altitude

Warning (MSAW). Finally, although the traffic is reported at the country level, all ATC

Centres provide the majority of ACC services in their respective countries. For this

reason, country-level traffic figures can be taken as a good indicator of the amount of

traffic controlled by each respective ATC Centre. Reviewing the available traffic figures,

only Japan differs significantly compared to other countries. The Tokyo area represents

one of the busiest airspaces in the world, comparable to the London and Maastricht

areas of Europe.

Table 8-3 ATM specialists involved in the assessment of RIFs

Resident ATC Centre

ATC Service provided

ATC system status1

Total IFR flights controlled within the country in 2005 (in thousands)

Shannon ACC/Oceanic Latest generation 6212

Oslo ACC Latest generation 4882

Malmo ACC Latest generation 6862

Vienna ACC Older generation 8192

Auckland ACC/Oceanic Latest generation 5553

Melbourne ACC/Oceanic Latest generation 6474

1 Source: personal correspondence with Dr Arnab Majumdar who visited all listed ATC Centres

2 Source: EUROCONTROL Performance Review Report (EUROCONTROL, 2006c)

3 Source: Airways New Zealand (2006b)

4 Source: Bureau of Transport and Regional Economics (2006). Australian Government


215

Christchurch ACC/Oceanic Latest generation 5553

Tokyo ACC/Oceanic Older generation 2,2505

The responses from the ATM specialists surveyed are used to inform 12 RIFs. For

three RIFs their responses have been used to either supplement the findings from the

past research (for the ‘experience with the system performance’ RIF) or validate

findings from the operational failure reports (for the ‘complexity of failure type’ and

‘duration of failure’ RIFs).

For majority of RIFs, the responses from the ATM specialists surveyed have been

consistent. However, for six RIFs some ATM specialist gave different answers. This

was the case with the following RIFs: ‘personal factors’, ‘communication for recovery

within team/ATC Centre’, ‘time course of failure development’, ‘adequacy of HMI and

operational support’, ‘airspace characteristics’, and ‘conflicting issues in the situation

(task complexity)’. For example, for ‘personal factors’ the majority of ATM specialists

reported this RIF as ‘suitable for the recovery process’ in 70 to 90 percent of failure

occurrences. However, Oslo and Tokyo ATM specialists reported personal factors as

‘suitable’ in less then 15 percent of failure occurrences. These lesser ratings of the

‘personal factors’ indicate the perception of ATM specialists on readiness of air traffic

controllers to face unusual/emergency situations, such as equipment failure.

Similarly, potential gaps are identified with Melbourne and Christchurch ATC Centres

where the majority of failures seem to be latent (accounted for 92 and 60 percent,

respectively). This is contrary to the answers provided from other ATC Centres. Finally,

the potential gaps regarding the ‘adequacy of airspace’ are identified by ATM

specialists from Auckland and Tokyo ATC Centres. They ranked airspace design and

configuration as tolerable, highlighting the potential for improvement of airspace

characteristics to enhance controller recovery performance.

It can be concluded that the ATM specialists from eight countries worldwide produced

similar ratings for the majority of RIFs. Identified inconsistencies reflect differences that

exist between these ATC Centres in terms of the ATC Centre culture (reflected in

personal factors), airspace design, and ATC Centre architecture. These differences are

reasonable as indicators of diversity that exists between ATC Centres within one

5 Source: Air Traffic Activity at Area Control Centre (last available for 2003) from Ministry of

Land, Infrastructure, and Transport (2006)


216

country as well as worldwide. As a result, the responses from the ATM specialists

surveyed have been taken to inform several RIFs. In future, the weighting scheme may

be used to account for the variability between ATC Centres (e.g. safety culture,

differences of ATC Centres, ATM specialists experience).

8.3.1.4 Past literature

Finally, the relevant data from past ATC research are used to inform probabilities for

the RIF ‘experience with the system performance’. The probabilities are determined

from the findings of Hilburn and Flynn (2001) and EUROCONTROL (2000b) in which

18 percent of controllers reported undertrust in technology. These findings are

combined by the responses from the ATM specialists surveyed on the percentage of

controllers with an excessive trust in technology (i.e. overtrust). Therefore, both

sources of information are used to establish the final probability rating for this particular

RIF (presented in Appendix VIII).

8.3.1.5 Aggregation of data

The previous sections have described four different sources of information used to

determine RIF probabilities. These are: operational failure reports, responses from a

questionnaire survey, responses from the ATM specialists surveyed, and past literature.

Table 8-4 reviews all four sources of information with respect to the level of confidence

and therefore the rationale behind the aggregation of data. Three data sources are

rated with a high level of confidence (questionnaire survey, responses from the ATM

specialists surveyed, and past literature). Only one source is rated with medium

confidence. More precisely, the confidence level for operational failure reports from the

CAA databases is not defined as ‘high’ due to the lack of information on the reliability of

available reporting schemes. There are reliability issues regarding the reporting of

safety occurrences recognised by CAAs 6 . However, none of the CAAs has a

methodology in place to assess the reliability of their reporting scheme, and therefore,

the completeness of the occurrence databases. Therefore, the medium ranking for the

confidence level is an assumption informed by operational experience. As a result, the

data from this source are validated by the findings from another source of data (i.e.

ATM specialists input) to assure reliable RIF ratings.

6 International workshop on the analysis of aviation incident/accident precursors. The workshop

was held on 25 and 26 May 2005 at Imperial College London.


217

Table 8-4 Overview of the sources of information used to determine RIF probabilities

Source Level of confidence

(subjective) Comment

Operational failure reports from the CAAs

Medium The confidence level is not defined as ‘high’ due to the lack of information on reliability of available reporting schemes

Operational failure reports from the

engineering unit of particular ANSP

High

The confidence level is defined as ‘high’ due to the fact that the engineering unit has to be aware of all equipment failures occurring in the ATC Centre as they are directly responsible for their maintenance and repair

Questionnaire survey High Responses from 134 air traffic controllers, from 58 ATC Centres, and 34 countries worldwide

ATM specialists High Conducted with ATC specialists from eight ATC Centres worldwide

Past literature High Hilburn and Flynn (2001) and EUROCONTROL (2000b)

In general, the above analyses employed the data from all four sources to define the

probabilities for 20 Recovery Influencing Factors (RIFs). These are presented in

Appendix VIII.

8.3.2 Summary

The preceding paragraphs have used the qualitative levels of the impact of each of the

RIFs (i.e. qualitative descriptor) defined in Chapter 7 and probabilistically defined each.

Overview of all 20 RIFs, their corresponding levels, and designated probabilities is

provided in detail in Appendix VIII and in a tabular form in Appendix X.

Having defined all 20 relevant recovery factors in the previous sections, it is possible to

define recovery context. In general the recovery context may be seen as a discrete

function since all possible contexts are defined exactly by 20 elements, and since each

RIF has only two or three defined levels. In mathematical terms, the existing method

can be expressed as a function f using a set of 20 RIFs to define the recovery context

indicator (Ic) as shown in equation 8-1:

),....,,( 2021 RIFRIFRIFfIc = 8-1

The total number of possible recovery contexts represents the number of combinations

of the 20 RIFs, where nine of them have three levels whilst eleven have only two levels

of impact. In total, this approach generates 39 x 211 = 40,310,784 possible contexts,

each having equal probability of occurrence of 1/40,310,784 = 2.4E-08. In

mathematical terms this is equivalent to finding all variation with repetitions of 20 RIFs


218

and their corresponding levels. In addition, each recovery context will have a specific

value of the recovery context indicator (Ic). The methodology to calculate this variable

is presented in the remainder of this Chapter.

Table 8-5 presents an example of a potential recovery context as a 20-digit array

where each digit corresponds by its position to a particular RIF and by its value to the

precise impact of a particular RIF on controller performance. At this stage, all RIFs are

considered independently and their corresponding levels of influence on controller

performance take integer value, i.e. 1, 2, or 3.

Table 8-5 Example of a potential recovery context represented as a 20-digit array

RIF ID RIF1 RIF2 RIF3 RIF4 RIF5 RIF6 RIF7 RIF8 RIF9 RIF10

Level 1 1 2 1 1 2 1 2 1 1


Level 2 2 1 1 3 3 3 1 3 3

The following sections show how the existing RIFs interactions may change the RIF

levels in either direction (i.e. increase the value of the level which corresponds to the

deterioration in controller performance or decrease the value of the level which

corresponds to an improvement in controller performance).

8.4 Interactions between Recovery Influencing Factors (Step 3)

The methodology for the assessment of the recovery context surrounding the

equipment failure occurrence presented in this Chapter is based upon 20 relevant

contextual factors or RIFs. In order to provide an accurate approach, this methodology

has to take into account all the interactions between these contextual factors. The

interactions have been initially established based upon operational experience and

validated by findings from HRA techniques and ATM specialists. The selection of all

relevant RIFs and establishment of their interactions creates a basis for the generation

of all possible recovery contexts and the calculation of the numerical indicator for each

context (Ic). The steps taken to identify RIFs interactions are presented in the following

sections.

8.4.1 Identification of RIF interactions

At first glance, the identified RIFs reveal possible interactions between them. For

example, a poorly designed display (i.e. HMI) as well as inadequate knowledge of ATC

system modes (i.e. inadequate training) may lead to delayed failure detection and less

efficient recovery. Furthermore, stress as a personal factor cannot be independent of


219

traffic and airspace complexity. If a controller deals with increased levels of traffic, it is

reasonable to assume that stress levels will be higher.

In order to determine the effect of contextual factors on controller performance it is

therefore necessary to describe these interactions, in addition to describing how they

affect controller performance. The analysis of interactions makes it possible to gain a

more accurate picture of the context and thus a better understanding of the recovery

process. In other words, this permits a broader retrospective analysis as well as a more

precise prediction of the effectiveness of the improvement measures. As noted by

Straeter (2000), such interactions could also point to additional factors previously

omitted, such as potential organisational shortcomings.

Straeter (2000) tackles this problem in CAHR by looking at the common appearance of

different factors (using available databases). The analysis is based on capturing the

observed interactions between reported contextual factors. The availability of a detailed

database is however a prerequisite to this approach. Hollnagel (1998) on the other

hand establishes these interactions in CREAM by considering each contextual

condition with respect to how it generally influences the others (there is no mention

whether expert judgement or operational expertise have been used). It is also

important to say that CREAM assumes reciprocal interaction between the contextual

conditions.

The interactions amongst predefined 20 RIFs have been determined based on known

relationships from operational experience and marked with symbol ‘√’ in Table 8-6.

They represent the irreversible influence between two RIFs or how RIFs in the first row

affect RIFs in the left hand column. The reason for irreversible influence lies in the

characteristics of the air traffic environment where one factor may influence the other

one without any reverse effect. For example, complex traffic can influence controller

personal capabilities in terms of increased stress, anxiety, and workload; while the

opposite influence (impact of personal capabilities on traffic complexity in the sector) is

simply not logical.


220

Table 8-6 Interactions matrix: (c) validation by CREAM, (h) validation by CAHR, (a) validation by ATM specialists; and (x) not validated interactions

RIF ID

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Direct Influence

Tra

inin

g f

or

recovery

fro

m A

TC

equip

ment

failu

res

Pre

vio

us e

xperience w

ith

equip

. fa

ilure

s

Experience w

ith s

yste

m

perf

orm

ance (

relia

nce)

Pers

onal fa

cto

rs

Com

m. fo

r re

covery

within

a

team

of contr

olle

rs

Com

ple

xity o

f fa

ilure

Tim

e c

ours

e o

f fa

ilure

develo

pm

ent

Num

ber

of w

ork

sta

tions/

secto

rs a

ffecte

d

Tim

e n

ecessary

to r

ecover

Exis

tence o

f re

covery

pro

cedure

Dura

tion o

f fa

ilure

Adequacy o

f H

MI

Am

big

uity o

f in

fo in the w

ork

ing

environm

ent

Adequacy o

f ala

rms/a

lert

s

Adequacy o

f ala

rms/a

lert

s

onset

Adequacy o

f org

anis

ation

Tra

ffic

Airspace c

hara

cte

ristics

Weath

er

conditio

ns

Task c

om

ple

xity

1

Training for recovery from ATC equipment failures

√

(a)

√ (c/a)

2 Previous experience with equip. failures

√ (a)

3 Experience with system perf. (reliance)

√ (a)

√

(h/a)

√

(h/a)

√

(h/a)

4 Personal factors √

(a) √

(a) √

(a)

√ (a)

√ (a)

√

(x)

√ (h/a)

√ (h)

√ (x)

√ (h/a)

√ (h/a)

√ (h/a)

√ (h/a)

√ (a)

√ (h/a)

√ (h)

√ (h/a)

√ (h/a)

5

Comm. for recovery within a team of controllers

√ (c/a)

√ (c/a)

√ (c/a)

√ (a)

√

(a)

√ (x)

√ (h/a)

√ (h)

√ (x)

√ (h)

√ (h/a)

√ (h/a)

√ (h/a)

√ (c/a)

√ (a)

√ (x)

√ (a)

√ (h/a)

6 Complexity of failure type

√

(a)

7 Time course of failure develop.

√

(a)

8 Number of workstations/ sectors affected

√

(a)

√ (a)

9 Time necessary to recover

√ (h/a)

√ (h/a)

√ (h/a)

√ (c/h/a)

√ (c/h/a)

√ (a)

√ (a)

√

(c/a)

√ (a)

√ (c/h/a)

√ (c/h/a)

√ (c/h)

√ (c/h/a)

√

(h/a)

√ (h/a)

√ (h/a)

√ (c/h/a)

10 Existence of recovery procedure

√

(c/a)

11 Duration of failure

√

(a) √

(a)

12 Adequacy of HMI

√

(a)

√ (a)

√ (a)

√ (c/a)

13 Ambiguity of info in the working environment

√ (a)

√ (a)

√

(a)

√ (a)

√

(a)

√ (c/a)

√

(c/a)

14 Adequacy of alarms/alerts

√

(a)

√ (a)

√

(c/a)

15 Adequacy of alarms/alerts onset

√

(a) √

(a)

√ (a)

√

(a)

√ (c/a)

16 Adequacy of org. √

(a) √

(a) √

(a) √

(a)

17 Traffic √

(a)

√ (a)

√ (a)

18 Airspace char. √

(a) √

(a)

√ (x)

19 Weather

20 Task complexity √

(h/a)

√

(h/a)

√ (h/a)

√ (a)

√ (a)

√ (a)

√ (h/a)

√ (c/h/a)

√ (a)

√ (c/h/a)

√ (c/h/a)

√ (c/a)

√ (c/h/a)

√ (a)

√ (a)

√ (a)

√ (a)


221

8.4.2 Validation of RIF interactions

This section validates the interactions identified in the previous section. This was

carried out in two stages. The first stage (sections 8.4.2.1 and 8.4.2.2) addresses

interactions identified in existing literature (CREAM and CAHR techniques). Although

Chapter 7 presented the basic principles behind these two techniques and extracted

candidate RIFs, this Chapter focuses only on the assessment of the interactions

between contextual factors identified in both techniques. The second stage (section

8.4.2.3) identifies the interactions based on the input by three ATM specialists. The

self-completion method was used to collect their responses.

8.4.2.1 CREAM

A comparison of the interactions between contextual factors defined in the CREAM

technique (i.e. CPCs) and those defined between RIFs (Table 8-6) shows a degree of

mapping. A direct link was found with all interactions except those relevant to ‘working

conditions’ and ‘number of simultaneous goals’ CPCs. As already explained in Chapter

7, these two contextual factors are excluded from the list of RIFs. Note that the

interactions relevant to the ‘crew collaboration quality’ CPC are compared with those

related to the ‘communication for recovery’ RIF, because mostly verbal form of

teamwork occurs after the detection of equipment failure.

The CREAM technique is developed as a generic technique for the analysis of human

actions. Therefore, it is not specifically ATC oriented and cannot entirely reflect the

characteristics of the ATC environment. For this reason, several RIFs could not be

mapped to the CPCs. These are personal factors (except ‘time of the day’ as one of the

contextual factors identified in CREAM), complexity of failure type, time course of

failure development, number of workstations/sectors affected, duration of failure, traffic

complexity, airspace characteristics, and weather conditions. In general from all the

interactions identified amongst the RIFs, 22 percent have been reflected in CREAM.

Mapping between CREAM CPCs factor interactions and RIF interactions is presented

with symbol ‘c’ in Table 8-6.

8.4.2.2 CAHR

A comparison of the interactions between six Man-Machine System (MMS) and their

corresponding PSFs defined in CAHR and those defined between RIFs (Table 8-6)

shows a degree of mapping. This mapping is presented in Table 8-7.


222

Table 8-7 Mapping between RIFs and CAHR contextual factors

RIF MMS

Personal factors Person Complexity of failure type Task Number of workstations affected

System

Duration of failure Task Time necessary to recover Task Time course of failure development

System


Order-issue

Adequacy of HMI Feedback

Adequacy of alarms/alerts Airspace-related factors Task/activity

Several identified PSFs are relevant to the nuclear plants (e.g. task preparation,

precision, labelling, marking), whilst the majority are applicable to recovery from

equipment failures in ATC (e.g. time pressure, procedures, HMI). Straeter (2000)

presents reciprocal interactions between PSFs in CAHR as captured through the

analysis of the common appearance of different factors in individual events from

nuclear databases. Table 8-6 presents these interactions (marked with ‘h’ in Table 8-6).

35 percent of the RIF interactions are captured by CAHR.

8.4.2.3 Validation by ATM specialists

Various interactions between failure characteristics, airspace, traffic, personal factors,

ambiguity of information in the working environment, and the time necessary to recover

have not been confirmed through the preceding validation processes. However, the

existence of links between these factors has been validated independently by three

ATM specialists.

These ATM specialists come from the same ATC Centre and have more than ten years

of operational experience in the ATC domain. ATM specialists reviewed existing

interactions and marked those with which they disagreed. Their input was taken

through a small-scale self-completion survey based on the interactions identified in

Table 8-6 and marked with ‘√’. The exact form used in this small-scale survey is

presented in Appendix XI. The comparison of their independent validations showed

similarities. Several inconsistencies were identified, mostly due to ATM specialists

initially reading the matrix wrongly. These were clarified via personal correspondence

before the final validation. As a result, 90 percent of the RIF interactions from Table 8-6

have been validated by the ATM specialists (marked with ‘a’ in Table 8-6).


223

8.4.2.4 Validation summary

95 percent (107 interactions out of 113) of the RIFs interactions have been validated by

existing literature and ATM specialists. The remaining six interactions were not

validated by either of the sources available. These, marked with ‘x’ in Table 8-6, are:

� impact of ‘number of workstations/sectors affected’ on ‘personal factors’;

� impact of ‘duration of failure’ on ‘personal factors’;

� impact of ‘number of workstations/sectors affected’ on ‘communication for

recovery’;

� impact of ‘duration of failure’ on ‘communication for recovery’;

� impact of ‘airspace characteristics’ on ‘‘communication for recovery’; and

� impact of ‘weather’ on ‘airspace characteristics’.

From the perspective of past research and ATM experts input these six interactions do

not exhibit any correlation and thus, the research presented in this thesis excludes

them from the remaining analysis. However, a more quantitative approach would be

required in future. For example, further development of the HERA database could allow

additional validation of RIF interactions (including these six). Furthermore, it could allow

the quantification of their level of influence through the definition of the coefficient of

interaction. Details on the coefficient of interaction are presented in the next section.

8.4.3 Quantification of RIFs interactions

The validated RIFs interactions above were used to develop a method to quantify the

level of interactions. The most accurate approach would be to analyse each interaction

separately as presented in equation 8-2:

∑∑ +=×+=

x

xxyj

x

xxyjj RkRIFYRkRIFYRIFY ' 8-2

where,

RIFYj represents a level j of RIFY; j =1, 2, or 3;

RIFYj’ represents a level j’ of RIFY after incorporation of RIF interactions, 0.0 ≤ j’ ≤ 4.0;

kxy represents the coefficient of interaction between RIFX and RIFY (kxy≠kyx);

Rx depends upon the level of RIFX → Rx={+1, 0, -1}

In other words, kxy is the numerical representation of the direct influence that RIFX has

on RIFY. Note that the interaction factor represents irreversible interaction (i.e. kxy ≠ kyx).

Taking into account the overall lack of quantitative assessment of context in the area of


224

ATC, it is difficult to determine each coefficient kxy separately. As already discussed in

section 8.1.2, some initial attempts to establish a detailed database that captures the

human performance data are planned by EUROCONTROL through the Human Error in

ATM (HERA) project (EUROCONTROL, 2002d). Although the interactions do not

necessarily have the same level of influence, this thesis had to define a more generic

approach to account for lack of operational data. Nevertheless, if the RIFs interactions

become quantifiable (e.g. via HERA database), the methodology presented in this

Chapter will still be valid.

As a result, this thesis follows the assumption that all determined interactions have the

same level of influence, referred to as k. Namely, it is assumed that interactions

between all pairs of RIFs are equal and as such that there is only one coefficient, k=1/

(N-1). N represents the total number of relevant RIFs for a particular ATC Centre or a

particular incident under investigation. In addition, (N-1) is used because one factor

cannot influence itself. Therefore, in the case of 20 relevant factors, the coefficient of

interaction would be calculated as k=1/19=0.053.

One important assumption made here is that all RIFs which influence a particular RIF

can never change its level by more than one unit, e.g. from Level 3 to Level 2 but not

from Level 3 to Level 1. The reason for this is that it takes more than 50 percent of

relevant RIFs to influence one particular RIFs in exactly the same manner in order to

change its level (either enhancing or worsening it). For example, in the generic

approach where all 20 RIFs are relevant, it will take at least 11 RIFs, all defined via

Level 1, to influence one particular RIF in order to enhance its level by one unit, either

from Level 3 to Level 2 or from Level 2 to Level 1. This concept is similar to the

approach presented in CREAM (Hollnagel, 1998).

As a consequence of incorporating RIF interactions, the RIF levels change. Table 8-8

presents the change in the RIF levels from the initial integer values (i.e. 1, 2, or 3)

presented in Table 8-5. If the level of any RIF decreases as a number this means that

other RIFs impacted this particular RIF in such a way that the change enhances

controller performance (see RIF20 in Tables 8-5 and 8-8 which decreased from the

initial value of 3 to a new value of 2.74). Similarly, if the RIF level increases as a

number means that other RIFs impacted this particular RIF in such a way that the

change degrades controller performance (see RIF18 which increased from the initial

value of 1 to a new value of 1.11). It is important to note that the probability of the


225

occurrence of any context, with or without incorporation of RIF interactions, is the same

(1/40,310,784=2.4E-08 as previously reported in section 8.3.2).

Table 8-8 Recovery context (as presented in Table 8-5) after the incorporation of RIF interactions


Level 1.00 .95 1.95 .84 .89 2.05 1.05 2.05 .74 1.05


Level 1.95 2.00 0.89 1.05 2.95 2.89 2.95 1.11 3.00 2.74

In short, a change (increase or decrease) in the value of a particular RIF represents the

final outcome of all possible interactions with that particular RIF. For example, RIF5

level changes from value 1 to value 0.89 as a results of the influence of 15 different

RIFs, as seen from the matrix in Table 8-6 (see row 5).

In this particular example, RIF1, RIF2, RIF4, RIF9, RIF10, RIF13, and RIF14 influence

RIF5 in a positive way as they are defined via Level 1. As a result, each of these seven

RIFs decreases the RIF5 level by -1/19=-0.053. However, RIF15, RIF16, RIF17, RIF19,

and RIF20 influence RIF5 in a negative way as they are defined via Level 3. As a result,

each of these five RIFs increases the RIF5 level by +0.053. Other RIFs, namely RIF3,

RIF6, and RIF12 do not have any influence on RIF5 as their level is 2, which assumes

no significant influence on human performance. Furthermore, RIF7, RIF8, RIF11, and

RIF18 have no impact on RIF5 and therefore are not considerate. The result of this is

an overall decrease in RIF5 level as follows (equation 8-3):

894.0106.01)053.0(215)(755 ' =−=−×+=×+−×+= kkRIFRIF jj 8-3

The incorporation of all identified RIF interactions applied to all the identified recovery

contexts (all 40,310,784 of them) made it possible to identify the distribution of all RIFs.

Prior to incorporation of RIF interactions, the distribution of each level is the same. For

example, Figure 8-2 represents the distribution of RIF5 without incorporation of RIF

interactions. This graph represents three levels of RIF5 in a symmetrical manner, each

accounting for exactly 13,436,928 contexts or one third of the total (Figure 8-2). This

results in equal representation of each level in the 40,310,784 possible recovery

contexts.


226

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

16000000

00.

30.

60.

91.

21.

51.

82.

12.

42.

7 33.

33.

63.

9

Level

Fre

qu

en

cy

Figure 8-2 Distribution of RIF5 levels amongst identified recovery contexts without interactions

However, due to the identified interactions, the distribution of RIF5 levels amongst all

possible recovery contexts takes a different, more dispersed, shape (Figure 8-3). It is

notable that the more interactions exists with a particular RIF, the more dispersed the

distribution of levels will be. The example utilised in this section (i.e. RIF5) has a

substantial number of other contextual factors that affect it, namely 15. However, in

some cases the number of identified interactions can be small (e.g. one or two) while in

the case of RIF19 (weather conditions) there are no identified interactions and thus this

RIF has a similar distribution to RIF5 (Figure 8-2). In any case, the total number of

recovery contexts where RIF5 (or any other RIF) is defined via Level 1 remains the

same whether RIF interactions are incorporated or not. The distribution of the levels for

each of the 20 RIFs is presented in Appendix XII in a tabular format.

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

0.1

0.3

0.5

0.7

0.9

1.1

1.3

1.5

1.7

1.9

2.1

2.3

2.5

2.7

2.9

3.1

3.3

3.5

3.7

3.9

Level

Fre

qu

en

cy

Figure 8-3 Distribution of RIF5 levels amongst identified recovery contexts with interactions

Once the RIF interactions have been identified and their impact quantitatively

determined, the next step is to re-calculate existing RIF probabilities to more accurately

reflect newly determined RIF levels. However, to achieve this step it is necessary to


227

determine the cut-off points between any two consecutive levels of influence, i.e. to

determine the precise boundaries between Level 1, Level 2, and Level 3. Another

option would be to consider each of the distributions separately, i.e. covering the entire

spectrum (-∞, +∞). In this way, there is no cut-off point and there is coherency between

all results as well. However, both approaches yield similar results as there is very little

overlap between these distributions. The following section explains the method applied

to determine the cut-off points between any two consecutive RIF levels.

8.5 Methodology for the determination of the cut-off points (Step 4)

As a result of differences between the interactions affecting different RIFs (see Table 8-

6) as previously highlighted, the cut-off points between different RIFs will vary from one

RIF to the other. The shape and dispersion of the distribution of levels for each RIF

depends upon the number and type of interactions with other RIFs. As an example,

observe the difference in the distribution of levels for RIF1 (Figure 8-4) and RIF20

(Figure 8-5), where RIF1 is impacted by two different RIFs while RIF20 is being

impacted by 17 different RIFs.

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

10000000

0.1

0.4

0.7 1

1.3

1.6

1.9

2.2

2.5

2.8

3.1

3.4

3.7 4

Level

Fre

qu

en

cy


0

1000000

2000000

3000000

4000000

5000000

6000000

0.1

0.4

0.7 1

1.3

1.6

1.9

2.2

2.5

2.8

3.1

3.4

3.7 4

Level

Fre

qu

en

cy



228

The statistical method for determining the cut-off points between the levels for each

RIF is based on the 95 percent confidence interval for each level. For example, a 95

percent confidence interval for Level 1 of RIF1 would cover 95 percent of the normal

curve, where the probability of observing a value of Level 1 RIF1 outside of this area

would be less than 0.05. Under the assumption of a normal distribution7, the interval

range (µ - 2σ, µ + 2σ) captures approximately 95 percent of data.

The advantage of this approach is that it takes a common statistical approach. In

addition, this method relies upon known values of µ and σ in order to define interval the

range for each level. In other words, to calculate the values of µ and σ for RIF1 Level 1,

it is necessary to already have an assumption about the sample size (depicted as N in

equation 8-4).

N

XN

n

n∑=

=1

µ N

XN

n

n

2

1

)( µ

σ

−

=

∑=

, where 8-4

µ represents population mean for RIF1 Level 1 (population of all possible recovery

contexts where RIF1 is defined through Level 1);

σ represent population standard deviation for RIF1 Level 1;

N represents the total number of recovery contexts in which RIF1 is defined via Level 1;

Xn represents the n-th value of the variable RIF1 Level 1 (n=1,2, …. , 40,310,784).

To overcome this, three different interval values or three different cut-off points

(assumed based upon the initial distribution of data) are tested. For example, when

assessing the cut-off points between levels of RIF5, three different values between

Level 1 and Level 2 have been tested (namely Fit 1, Fit 2, and Fit 3 in Figure 8-6).

7 Corresponds to the symmetrical distribution of levels around the values of 1, 2 and 3, but also

to the large number of observations.


229

Figure 8-6 Distribution fitting for the three cut-off points on the example of RIF5 Level 18

The normal distribution parameters, as presented in Table 8-9, show no difference

between the distribution of RIF 5 Level 1 data when first and second cut-off points are

applied. However, the use of third cut-off point determines a different distribution. This

is expected as the third cut-off incorporates data which shows increased frequency for

the value of 1.8 (see Figure 8-7 and Table 8-9). Based on this, Fit 1 and Fit 2,

corresponding to cut-off points 1.6 and 1.7 respectively, are taken forward. However, it

is necessary to determine which of these two values will be taken as a final cut-off point.

Table 8-9 Descriptive statistics for the three cut-off points on the example of RIF5 Level 1

RIF5 Level 1 Cut-off point

used Mean

Standard deviation

Standard error on the mean

Fit 1 1.6 1.18 0.17 4.59E-05 Fit 2 1.7 1.18 0.17 4.65E-05 Fit 3 1.8 1.19 0.19 5.11E-05

In order to precisely determine the optimal cut-off point, it is necessary to apply a

polynomial function to the data between the mean values for Level 1 and Level 2 and

determine the minimum of that function. The polynomial function minimum rounded to

the first decimal should indicate the cut-off point (either 1.6 or 1.7). Table 8-10 presents

three different polynomial functions applied to distribution of RIF5 Level 1 and Level 2

8 Probability density function approach represents distributions so that the sum of the areas of

the rectangles equals 1.


230

data. The calculation of the function minimum9 shows that regardless of the type of

polynomial function, the local minimum corresponds to the cut-off point at 1.7 (Table 8-

10). The fit of a cubic polynomial function to RIF5 Level 1 data is presented in Figure 8-

7. Since Table 8-9 shows that the choice of cut-off at 1.6 and 1.7 constitute no

significant difference, and since the function minimum is closer to the value of 1.7, this

value is taken forward as a cut-off point between RIF 5 Level 1 and Level 2.

Table 8-10 Local minimums of polynomial functions

Polynomial function f(x) Local minimum

Quadratic 1E07(1.3472x2 - 4.5848x + 3.9200) 1.7016

Cubic 1E07(-0.5613x3 + 4.2097x

2 - 9.3510x + 6.5076) 1.6653

Quadric 1E08(-0.1785 x4 1.1574 x

3-2.6289x

2 +2.4203 x -0.7121) 1.6756

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2

Level

Fre

qu

en

cy

f(x)= 1E07(-0.5613x3 + 4.2097x2 - 9.3510x + 6.5076)

Figure 8-7 Cubic polynomial function f(x) fitted for the RIF5 data to determine its minimum

Similarly, the value of 2.7 is taken as a cut-off point between Level 2 and Level 3 (see

Table 8-11). Using the same methodology, the cut-off points are determined for all

RIFs and their corresponding levels. The established values are reported in Table 8-11.

Table 8-11 Cut-off points between the levels for all RIFs

RIF ID Cut-off point between Level 1 and

Level 2 Cut-off point between Level 2 and

Level 3

1 1.5 2.5 2 1.5 N/A 3 N/A 2.5

9 In the case of quadric polynomial functions, it is necessary to specify the local minimum (this

polynomial function has three first derivatives and thus potentially two minimums).


231

4 1.7 2.7 5 1.7 2.7 6 N/A 2.5 7 1.5 2.5 8 N/A 2.5 9 2.2 10 1.5 2.5 11 N/A 2.5 12 1.5 2.5 13 2.0 14 1.5 2.5 15 2.0 16 1.5 2.5 17 N/A 2.5 18 1.6 2.6 19 N/A 2.5 20 N/A 2.7

8.6 Specific effects of RIFs on controller recovery performance (Step 5)

While the previous section identified the cut-off points between consecutive levels of

each RIF, it is necessary to quantify the relationship between the particular level of a

RIF and its impact on controller recovery performance. This relationship has been

already defined qualitatively in Chapter 7 through the definition of the qualitative

descriptor. In short, Level 1 corresponds to the most desirable level, Level 2 to the

tolerable or average level, whilst Level 3 corresponds to the least desirable level in the

context of controller recovery performance.

In order to begin to look at the quantitative impact of each RIF level on the controller

recovery performance, the correlation coefficient is proposed. This correlation

coefficient is defined as: +1.00 corresponding to Level 1 (high positive relationship),

0.00 corresponding to Level 2 (no relationship), and -1.00 corresponding to Level 3

(high negative relationship). This approach is in line with the approach presented in

Oren, and Ghasem-Aghaee (2003) who also introduced a correlation coefficient as an

indicator of the relationship between the factors that define a personality (e.g.

openness, extroversion) and different personality types.

Once the relevant RIFs and their corresponding levels have been defined and linked to

the controller recovery performance, the next step is to present the recovery context as

a function of all contextual factors, their interactions, and impact of controller recovery

performance. The following section presents the definition of the recovery context via

recovery context indicator.


232

8.7 Calculation of the recovery context indicator (Step 6)

Based on the determination of the boundaries between consecutive levels for each RIF,

it is possible to proceed with the re-calculation of RIF probabilities and the

determination of the numerical indicator of each recovery context (i.e. recovery context

indicator - Ic). These are presented in the following sections.

8.7.1 Re-calculation of RIF probabilities

The main task at this stage is to re-calculate the probabilities that correspond to more

realistic (effective) levels resulting from the incorporation of all RIF interactions. The

previous example of one randomly chosen recovery context showed that RIF5 changed

from Level 1 (Table 8-5) to a new effective level (0.89; Table 8-8). Therefore, if the

probability of RIF5 at Level 1 is 0.73 (see Table 8-12), then it is necessary to determine

the probability of the new, effective level 0.89.

Table 8-12 Probabilities for the RIF5 and each of its levels (see Appendix X)

RIF5: Communication for recovery within team/ATC Centre

Level p(L)

Efficient 1 0.73 Tolerable 2 0.24 Inefficient 3 0.04

The way to approach this problem is firstly to determine all recovery contexts for which

RIF5 is represented via Level 1. In other words, it is necessary to determine the

number of recovery contexts for which the RIF5 level is smaller or equal to the cut-off

point between Levels 1 and 2 (i.e. 1.7, Table 8-11). This is presented in equation 8-5

below:

≤<==

≤<==

≤<==

=

=

∑ ∑

∑ ∑

∑ ∑

∑

−

+

−

+

=

−

=

=

+−

=

+

= =

4

'

,1

4

'

''3

'

1,,1

'

''2

1,

0' 0'

''1

'

'1

,1 3,2

1,

,1

3,2

2,1

1, 2,1

0.4',,3

',,2

'0,,1

j

jj

jj

jj

jj

Cj

jj

Cj

jj

C

Cj

jjjj

C

Cj

jj

jj

C

j

C

j

jj

j

j

jCRIFXRIFXRIFX

CjCRIFXRIFXRIFX

CjRIFXRIFXRIFX

RIFXRIFX

8-5


233

where

X represents different contextual factors, X= 1,2,3…,20;

j represents a level of RIFX and can take the values of 1, 2 or 3;

j’ represents a level of RIFX after incorporation of interactions where 0.0 ≤j’≤4.0;

Cj j+1 represents a cut-off point between Levels j and j+1;

For example, for RIF5 (Table 8-11):

0.4'7.2

7.2'7.1

7.1'0

,/,3

,7.2,2

,7.1,1

j 3,21,

2,11,

<<

≤<

≤<

==

==

=+

+

j

j

j

AN

CC

CC

jj

jj

Secondly, it is necessary to determine a subset of recovery context which correspond

to the newly determined level (i.e. 0.89). These are all recovery contexts having RIF5

level in the range (0.8, 0.9]. It should be noted that level 0.89 represents the value of

RIF5 level for one specific recovery context. Finally, the probability of the new level is

calculated as follows (equation 8-6):

055.0924,476,13

576,008,,173.0

)5(

)5(73.0)5(

)(

)()()(

1

89.089.0

'

'

=×=×=

=×=

RIFf

RIFfRIFp

RIFXf

RIFXfRIFXpRIFXp

j

j

jj

8-6

where

X represents different contextual factors, X= 1,2,3…,20;

j represents levels 1, 2, or 3;

f represents the sum of all possible recovery contexts;

p (RIF5 j) represents initial probability of occurrence of RIF5 for level j;

p (RIF5 j’) represents probability of occurrence of RIF5 for its new level j’;

f (RIF5 j’) represents the sum of levels for 0.89 < j’ ≤ 0.90; and

f (RIF5 j) represents the sum of all levels that correspond to the RIF5 Level 1

(i.e. 0.0 < j’ ≤ 1.7).

The new probability of occurrence (0.055) is low in its magnitude, but represents an

occurrence which a high probability of recovery. In other words, in this particular

context, RIF5 is enhanced by the influence of all the other RIFs that have interaction

with it. The final output of this methodology is the indicator of a specific recovery

context (Ic), as presented in equation 8-7. The characteristics of Ic are that, for

example, in the case of all 20 RIFs defined via Level 1 with the probability 1 and no


234

interactions, the value of Ic equals 1. Similarly, in the case of all 20 RIFs defined via

Level 3 with the probability 1 and no interactions, the value of Ic equals -1.

N

RRIFXpRRIFXp

levelsRIFsi j

jj

levelsRIFsi j

jj

c2

20

1

2

1

'

3

20

1

3

1

' )()(

I

×+

×

=

∑∑∑∑= == =

8-7

, where

All calculations relevant to the quantitative assessment of the recovery context

conducted in this thesis are performed using standard C programming language.

8.7.2 Distribution of the recovery context indicator

The recovery context indicator (Ic) represents the numerical representation of a specific

context that surrounds controller recovery from an ATC equipment failure. For

example, changes in the factors that constitute the recovery context (i.e. 20 RIFs),

captured via the change of their qualitative levels, interactions, and effect on controller

performance, are reflected in the change of the Ic magnitude. In practical terms, this

change facilitates better or worse controller recovery.

After the calculation of all 40,310,784 possible contexts it was determined that the

mean value of recovery context indicator (Ic) is 0.027, ranging between -0.069 and

0.131. The distribution of the Ic variable is presented in Figure 8-8.

p(RIFX j’) probability of RIFX with level j’, where X=1, 2, 3, …, 20 and 0.0 ≤ j’ ≤ 4.0. The level j’ takes into account all interactions between RIFs;

Rj correlation coefficient between RIFX and controller recovery performance. Depending upon level j’, it can take values {-1, 0, +1};

N total number of recovery factors (i.e. 40,310,784); and

p(RIFX j’) x Rj

probability of the overall situation occurring in one ATC Centre. In order to look at the quantitative impact that each RIF has on the controller recovery performance, each of the probabilities has to be multiplied with the correlation coefficient.


235

0

100000

200000

300000

400000

500000

600000

-0.0

7-0

.059

-0.0

48-0

.037

-0.0

26-0

.015

-0.0

040.

007

0.01

80.

029

0.04

0.05

10.

062

0.07

30.

084

0.09

50.

106

0.11

70.

128

Recovery context indicator (Ic)

Fre

qu

en

cy

Figure 8-8 Distribution of the recovery context indicator

This distribution is slightly positively skewed (right-skewed) since it has a longer tail in

the positive direction relative to the other tail. This is also confirmed by the positive

value of the statistical test indicating the concentration of values on the left side of the

distribution. The median value or value on the horizontal axis which has exactly 50

percent of the data on each side is -0.023. This positive skew may result from initial

inputs into the methodology for the quantitative (probabilistic) assessment of the

recovery context surrounding equipment failure in ATC. For example, observing the

probability values for each RIF and its corresponding levels it is clear that 12 out of 20

RIFs have a higher probability of enhancing recovery performance as opposed to

having no impact or negative impact. In other words, the probabilities of Level 1 for

these 12 RIFs are higher than for other level(s) (i.e. Level 2 and Level 3, see Appendix

X for details on RIFs probabilities). Therefore, it can be concluded that the framework

for a calculation of the recovery context in the ‘generic’ ATC Centre takes the value of

the recovery context indicator close to 0.027. This indicates that there is a large

potential for improvement and shift of the Ic values more towards a positive side, thus

enabling more appropriate contextual conditions.

In order to fully comprehend the characteristics of Ic, the next step is to calculate the

extreme values of Ic, from the most negative towards the most positive value of Ic. In

other words, it is necessary to determine the ‘ideal’ recovery context where all RIFs can


236

be expressed via Level 110. Similarly, it is necessary to determine the ‘worst’ possible

recovery context where all RIFs can be expressed via Level 311. In these cases, when

there is no uncertainty related to the probabilities of each RIF’s level, it is possible to

represent the most negative and the most positive recovery context.

Hence, the most negative value of Ic calculated using equations 8-6 and 8-7 takes the

value of -0.95. This value represents the worse possible recovery context that can

facilitate controller recovery performance in the ’generic’ ATC Centre. Similarly, the

most positive value of Ic calculated using the same equations is 0.65. These two

values are numerical representations of two extreme recovery contexts which are

mutually exclusive. However, these extreme values may be used as a good indicator of

the scale of changes that are possible to achieve within the ATC environment.

8.7.3 Sensitivity analysis

Because of the large number of recovery contexts (millions) it is reasonable to use the

assumption of normality in accordance with the central limit theory (Berenson et al.,

2006). When the data set is large, the sampling distribution of the mean is

approximately normally distributed. Using this assumption, it is possible to carry out an

analysis of the sensitivity of Ic to changes in any one recovery influencing factor.

The first step is to determine an interval around the baseline (population) mean that

includes 95 percent of the sample means or µ±2σ. According to the statistics presented

in Table 8-13 this range is 0.027+/-0.058. The second step is to implement a particular

change and test whether the sampled recovery context indicator comes from the same

population. As an example, it is assumed that the ‘training for the recovery’ provided to

air traffic controllers includes the equipment failure in question. Therefore, since there

are no uncertainties, this RIF can be defined exactly via Level 1 and its corresponding

probability (p=1). Sample statistics are presented in Table 8-13.

10

RIF3, RIF6, RIF8, RIF11, RIF17, RIF19, and RIF20 do not have the possibility of Level 1 and thus these will take the next most desirable level, being Level 2. 11

RIF2 does not have the possibility of Level 3 and thus it will take the next most undesirable level, being Level 2.


237

Table 8-13 Sensitivity analysis

Step change Statistics (M, SD) Baseline mean range

Baseline N=40,310,784

M=0.027 SD=0.029

(-0.031, 0.085) Sample 1 (change of RIF1)

N=13,436,928 M=0.061

SD=0.035

Sample 2 (change of RIF1 and RIF2) N=6,718,464

M=0.091 SD=0.023

With suitable training for the situation in question (e.g. a particular failure type) there is

no significant difference between the sample and baseline means but it is observable

that the value of Ic shifts toward a more positive value. Therefore, a second sample

was taken, assuming additionally that RIF2 or ‘experience with equipment failure’

matches precisely the equipment failure in question. In other words, RIF2 can be

defined exactly via Level 1 and its corresponding probability (p=1). The result of this

analysis shows that there is a significant change in the recovery context, since the

obtained mean does not fit the 95 percent confidence interval determined for the

baseline. Therefore, the enhanced recovery context (sample 2) comes from a

population different from the baseline recovery context. This finding indicates that the

value of Ic is sensitive to changes in the individual RIFs.

8.7.4 Optimal solutions

The methodology for the quantitative assessment of the recovery context presented in

the previous sections allows for the investigation of the recovery context in a particular

ATC Centre as well as for a particular equipment failure event. Furthermore, this

approach creates a basis for quantitative assessment and the choice of optimal

solutions for recovery enhancement. These solutions should be reviewed through the

changes in RIFs, their corresponding level, and the resulting changes in the value of Ic.

Whilst not all RIFs could be enhanced, it is necessary to focus on those which may be

affected. For instance, it is reasonable to assume that internal factors have a significant

potential for change either by enhancement of training or personal abilities on a daily

basis (e.g. fatigue, health, attitude, stress). A review of the other three RIF groups

(equipment related, external, and airspace related) reveals potential areas of change

as well as factors which cannot be influenced at the level of a particular ATC Centre

but possibly at the level of a region (e.g. traffic complexity is possible to impact on the

regional ATM level through the central flow management unit).

The optimal change is defined as the best ratio between the benefit and the cost of the

proposed recommendations. Benefit is defined as a shift in the RIF levels toward more


238

desirable Level 2 (average) or Level 1 (most favourable) and an overall shift in the

recovery context indicator (Ic) towards more positive values (e.g. extreme positive

value). The cost should be defined through the inherent costs linked to the proposed

recommendation and therefore, should include actual rather than generic costs of the

proposed change within the specific ATC Centre. Thus the cost may include the

following:

� costs of technical changes, followed by any other operational costs (delay in the

use of new system due to necessary maintenance, staff training);

� costs of designing a new procedure, followed by the cost of training the staff (i.e.

time and resources);

� cost of additional Team Resource Management (TRM) training;

� creation of a more adequate organisational environment. The examples are

improvements in terms of roles and responsibilities, the availability of team

members, the adequacy of supervision, the availability of additional support (e.g.

assistant), the personnel selection process, shift patterns and personnel planning,

attitude to teamwork, safety culture, stress management programs, support for

the organised exchange of past experience on non-nominal events,

communication with management and technicians (e.g. briefings, exchange of

knowledge, bulletins, safety panels); and

� the costs of any potential changes in airspace design.

The methodology presented in this thesis is able to provide the benefit of each

proposed solution. However, the evaluation of the related costs, as opposed to the

benefit, is not so straightforward and would necessitate input from ATC Centres.

Therefore, another approach may be utilised to ‘rate’ the benefit of implemented

changes on the level of ATC Centre, namely by the calculation of the ‘recovery context

efficiency’. This variable represents the ratio between the value of current recovery

context and the value of the most positive recovery context feasible in a particular ATC

Centre.

8.8 Summary

This Chapter has presented a methodology for the quantitative assessment of recovery

context. It started by reviewing the past HRA research of relevance to the quantitative

analysis of contextual factors. This has resulted in the selection of the CREAM

technique and its application by Kim, Seong, and Hollnagel (2005) for further

development. Building on this, a novel methodology has been developed for the

research presented in this thesis. This method assessed controller recovery


239

performance based on 20 relevant contextual factors (RIFs) and through several

distinct steps. Each RIF and its corresponding levels have been probabilistically

determined using four sources of information. These are operational failure reports,

questionnaire survey, input from eight ATM specialists, and past ATM related literature.

The methodology has further built on this and incorporated RIF interactions. This has

resulted in the change of the RIF levels and re-calculation of the corresponding

probabilities. The outcome of the entire methodology is the definition of the recovery

context indicator (Ic), as a numerical representation of a specific context surrounding

recovery from equipment failure in ATC. Ic is sensitive to the RIF changes and as such

may be used to investigate solutions to enhance the controller recovery. In other words,

the benefits of any safety-relevant changes in ATC Centres may be quantitatively

assessed in two separate ways. Firstly, the benefit can be assessed as a shift in the

distribution of the recovery context indicator from the baseline (pre-change) value to

the new value (as a result of implemented changes). Secondly, it is possible to

calculate the context utilisation or the ratio between the current value of the recovery

context and its most positive value achievable within the particular ATC Centre.

After the review of the methodology for the quantitative assessment of recovery context

in a specific ATC environment, the following Chapter 9 describes an experimental

investigation designed to further verify the proposed methodology.

Chapter 9 Experimental Investigation

240

9 Experimental Investigation of the Air Traffic Controller Recovery Performance

After the review of the methodology for the quantitative assessment of the recovery

context in the previous Chapter, this Chapter describes an experiment designed to

further validate the proposed methodology and capture the controller recovery

performance. This Chapter begins with a high-level design for the process adopted for

the experiment. This is followed by the rationale behind the need for the experiment

defined through several objectives. In order to achieve these objectives, this Chapter

describes the overall design of the experiment and selection of potential equipment

failures initially tested in a pilot study. It continues by providing the key requirements for

the experiment of relevance to this thesis, measured variables, and experimental

procedure.

Both the pilot and the main experiment were conducted in close collaboration with one

European Civil Aviation Authority (CAA)1. This particular CAA provided all of the

necessary infrastructure and staff from two ATC Centres during the period of the

experiment in 2005 and 2006. One ATC Centre was used for the pilot study which

tested the feasibility of the experimental design and its overall methodology. The other

ATC Centre was used on three separate occasions to simulate a selected unexpected

equipment failure in order to capture data on the recovery performance of 30 licensed

air traffic controllers. The Chapter concludes with a discussion of measured variables

used to capture the characteristics of controller recovery in ATC. The data collected is

subjected to a rigorous analysis in Chapter 10.

1 This CAA performs the function of Air Navigational Service Provider (ANSP) and the term CAA

will be used to denote also ANSP in the remainder of this thesis.


241

9.1 High-level design of the experimental process

Figure 9-1 below indicates the steps of organising and conducting this experiment. The

process starts with the rationale behind the need for experiment designed to capture

controller recovery performance. It proceeds with the assessment of available

resources, with focus on two key requirements, namely access to an ATC simulator

and the participation of controllers. Once these requirements have been assured, the

experimental process proceeded with the initial planning and design of the experiment

(i.e. airspace and traffic scenario, equipment failure type). Once this design had been

tested in a pilot study, the experimental process proceeded with the main experimental

study. Collected data are pre-processed and subjected to a rigorous analysis to extract

information of controller recovery from an operational environment (presented in

Chapter 10).

Rational for the experiment

Planning for the experiment

Design of the experiment

Assessment of the available resources

Pilot study

Revision of the pilot study

Main experimental

study

Data processing and analysis

In case of necessary changes

Selection of the equipment failure

Figure 9-1 The flow diagram of the experimental process


242

9.2 Rationale for the experiment

The preceding Chapters presented a detailed overview of equipment failure

occurrences in the ATC environment from both technical and human perspectives. The

findings from past literature were augmented by operational failure reports (capturing

the technical aspect of equipment failures) and feedback from an international

questionnaire survey (capturing both technical and human aspect of equipment

failures). Furthermore, factors relevant to controller recovery were identified using both

theoretical and operational findings. These factors, referred to as Recovery Influencing

Factors (RIFs), created a basis for the quantitative assessment of the recovery context.

This Chapter builds on the preceding Chapters and generates ‘real’ operational data on

controller recovery. These data are further used in Chapter 10 to verify the quantitative

assessment of the recovery context developed in Chapter 8 and the relevance of RIFs

identified in Chapter 7.

9.3 Assessment of the available resources

An assessment of the requirements and necessary resources for the experiment

highlighted the need to perform it either at an ATC Centre or a research institution

appropriately equipped. The critical requirements of the experimental design can be

grouped under two particular categories. These are the access to an ATC simulator

and the availability of licensed controllers. Based on these requirements several

potential locations were assessed:

� The Maastricht Upper Area Control Centre (MUAC) in the Netherlands. This is a

EUROCONTROL operational and simulation facility having the resources to support

both access to simulators and controllers;

� Human Factors Lab at the EUROCONTROL Experimental Centre (France),

providing access to simulators but not controllers;

� The CEATS Research, Development and Simulation (CRDS) Centre in Budapest

(Hungary). This is a EUROCONTROL facility providing access to simulators but not

controllers; and

� Various Civil Aviation Authorities (CAAs), air navigational service providers

(ANSPs) and their respective ATC Centres providing access to both simulation

facilities and controllers.


243

Although the requirements for an experimental plan were ready at the initial stage of

the research, it took two years to gain access to the required facilities. After

considerable negotiations with all potential locations, only one CAA responded

positively and agreed to provide both simulation facilities and staff for this experiment.

Both the pilot and the main study were conducted using their facilities, assistance, and

manpower.

9.4 Planning for the experiment

The review of the relevant literature, presented in Chapter 5, revealed that there is a

lack of detailed knowledge of how controllers perform during unexpected or unusual

situations (including equipment failures). This is partly due to the fact that there is no

relevant data available in the public domain2. This necessitated the design of an

experiment in this thesis to capture and exploit the relevant data.

As a result of close academic cooperation, one European CAA gave Imperial College

London the opportunity to plan, prepare, and run an experiment designed to study the

factors that drive the process that controllers follow to recover from ATC equipment

failures. This experiment was conducted in two phases (see Table 9-1). The first phase

involved a pilot study designed to test the feasibility of the experimental plan including

the appropriateness of the recovery methodology, serviceability of the equipment, and

clarity of the instructions to the participants-controllers working in the ATC Centre. The

results of the pilot study were used to enhance the plan for the main experiment. The

second phase of the study involved the execution of the main experiment where data

was collected for further analysis. A secondary objective was to assess and augment

the existing emergency training procedures as defined by this particular CAA in their

Manual of Air Traffic Services (MATS).

The planned experiments assumed a level of knowledge (on the part of the researcher)

necessary to fully comprehend the recovery process, in terms of the reactions and

actions of the controller in dealing with unexpected equipment failure. For this reason, it

was essential to acquire certain skills before running the actual experiments. To

achieve this objective, practical simulator training was completed by the researcher

prior to the execution of the main experiments (Table 9-1). The scheduled training was

2 Some research was done in the UK National Air Traffic Services (NATS), but was not released for public use.


244

preceded by a review of relevant ATC topics in order to prepare efficiently for practical

work on the simulator. The relevant areas covered were ATC phraseology, operational

procedures, equipment, radar vectoring, speed control, level busts, and aircraft

performance.

Table 9-1 Training, pilot study, and experiment sessions

Date Phase Objective Comment

19-20 Feb 2005 Planning for the

experiment

Basic training for the ab initio student, APP training

Total of 10h training on simulator

26-27 Feb 2005

APP training (arrivals and departures sequencing, radar vectoring)

Total of 10h training on simulator

02 Nov 2005

Phase I Pilot study Total of three

controllers participated

29 Nov – 01 Dec 2005

Phase II

Main study I Total of eleven

controllers participated

27 Feb – 02 Mar 2006

Main study II Total of ten controllers

participated

06 Jun – 09 Jun 2006

Main study III Total of ten controllers

participated

9.5 Design of the experiment

Since equipment failures are rare events3 , the experiment aimed to represent failure in

the most realistic form, i.e. as unexpected event. To assure the occurrence of failure as

an unexpected event, each controller participated once in the experiment. The

experiment also assumed a single-controller ACC sector (as opposed to a team of

controllers) to allow best utilisation of available ATC staff and to lessen any logistical

difficulties. Before the experiment, controllers were to be informed of the objectives of

the study in highly generic terms. They were to be given the opportunity to ask specific

questions in the post-experiment debriefing session. Additionally, to assure the

discretion and confidentiality of this study, each participant was to be required to sign a

consent form which incorporated an agreement not to disclose any information

regarding this experiment. In this way, the true objective of the experiment, i.e. the

injection of the unexpected and unforeseen equipment failure, was preserved.

3 Most of the failures in the ATC environment are prevented or handled at the

technical/engineering level. Only a few failures manage to penetrate multiple redundancies and fail-safe system design and affect controller performance.


245

The experiments were to be conducted during morning and afternoon sessions with an

assurance that participants are tested in equal proportion during the two sessions. The

simulation room conditions (lighting, temperature, noise) were to be consistent for all

runs.

Each simulation run was planned to last approximately 30 minutes, followed by a

debriefing session of similar duration. The instant of the injection of equipment failure

was planned to be precisely determined during the pilot study, occurring between the

5th and 15th minute of each run. The equipment failure would last 15 minutes. This was

decided based on two factors. Firstly, operational data shows that the majority of

failures last up to 15 minutes (Chapter 4 section 4.4.6). This has been confirmed by the

questionnaire survey results (presented in Appendix VI). Secondly, the 15 minute

duration of failure represents enough time to observe, capture, and assess the

controller reactions, performance, and overall recovery strategy.

The selection of the equipment failure to be simulated in the pilot study was based on

the results of the analysis of operational failure reports, the qualitative equipment

failure impact assessment tool, and the results of the questionnaire survey. However,

this selection was constrained by the technical capabilities of the available simulation

platform. In other words, it was important to simulate failure as well as the restoration of

the relevant equipment. Thus, the simulator platform would have to provide this

particular capability for a selected failure type. The final decision on the equipment

failure to be simulated would be achieved after testing candidate failure types during

the pilot study. The detailed rationale behind the selection of potential equipment

failures for the pilot and main experiment is given in the following section.

Another important factor of the experiment was the involvement of a Subject Matter

Expert (SME). The role of the SME would be to act as an observer and the coordinator

of the operations room. Upon a request from a controller, the SME would be

responsible for issuing any relevant information about the failure and its effect on the

ATC Centre (as would be required in the operational environment upon receiving an

update from the system control and monitoring unit). Upon restoration of the

equipment, there are several steps that controllers must perform to assure equipment

reliability and hence its readiness for the restoration of normal service (i.e. post-

restoration steps). Therefore, additional time would be given to controllers in the post-

restoration part of the simulation run, from the 25th to the 30th minute of each run. This


246

is to restore a normal working strategy after the effects of an unexpected equipment

failure.

Each simulation run would be observed by the researcher and the SME, and recorded

for the purpose of further data analysis. During each simulation run, notes would be

taken on each controller’s recovery performance and changes in attitude/behaviour

prior to and after the injection of a failure. This would enable both qualitative and

quantitative data to be captured.

The observation team would be positioned in the most unobtrusive way, still having a

clear view of the radar screen. The simulation runs would be followed by an immediate

debriefing session guided by the questionnaire and other material designed specifically

for this session. The controllers would assess all the factors that potentially influenced

their recovery performance, guided by the RIFs identified in Chapter 7. In addition, they

would be given an opportunity to judge their own performance and the credibility of the

simulated failure.

9.6 Selection of the equipment failure to be simulated

The classification of ATC system functionalities, presented in Chapter 2, identified nine

main categories. The critical subsystems, equipment, and tools were identified in each

category. This categorisation identified the number of components that could fail within

the ATC system architecture. To further assess the characteristics of equipment failure

occurrence, Chapter 4 reviewed some of the main characteristics of failures in terms of

complexity, time course of failure development, overall exposure, and impact on ATC

and ATM operations.

Further assessment of equipment failure types is presented in Chapter 4 and is based

on the detailed analysis of operational failure reports from four different countries. This

analysis shows that equipment failures dominate within the communication, navigation,

surveillance, and data processing functionalities. A subsequent analysis of the level of

severity showed that most failures that have a major impact on ATC operations occur

within the communication, surveillance, and data processing functionalities.

Furthermore, the availability of the ‘duration’ variable in one of the datasets (Country

D), enabled identification of equipment failures lasting up to 15min, which is the failure

duration feasible within this experimental set up. Failures with a major impact on ATC

operations lasting for a period of up to 15 minutes include: data exchange network,


247

other surveillance systems (predominantly radar link), the flight data processing

system, and air situational display (see Table 9-2).

Table 9-2 Overview of the potential equipment failures to be simulated and their inclusion in the pilot study

Source Potential

equipment failures to simulate

Qualitative equipment

failure impact

assessment tool rating

Adequacy for the pilot

study

Comment Testing in the

pilot study


(selection focused on

major failures of short

duration)

Data exchange network

Secondary functionality

No

It can range from moderate to minor and the selection tries to focus on major failures

-

Other surveillance systems (e.g. radar

link)

Secondary functionality

No -

Flight data processing system

Primary functionality

Yes - Reduced flight

plan mode

Air situational display


Yes

Not interesting enough from the

controller recovery perspective

-

Questionnaire survey

Air-ground communication


Yes - Aircraft radio

communication failure

Primary surveillance radar


Yes

Not possible to simulate failure of one radar, but only

the complete loss of radar coverage

-

Flight data processing system


Yes - Reduced flight

plan mode

Communication panel


No


controller recovery perspective as the

controller would simply change the

position

-

Ground-ground communication


No


controller recovery perspective as the controller would try

to establish communication via

other means

-

Furthermore, the analyses of the questionnaire survey responses in Chapter 6 (Table

9-2) identified the five most unreliable aspects of ATC equipment. These systems are:

air-ground communication, primary surveillance radar, flight data processing system,

communication panel, and ground-ground communication.


248

Having these nine possible failure types identified, it was necessary to select candidate

failure types for a final assessment in the pilot study in order to determine the failure to

be simulated in the main experiment. The rationale for this selection was based on the

severity of the failures as determined using the qualitative equipment failure impact

assessment tool (Chapter 4, section 4.5). The development of this tool was based

around the fact that not all equipment failures have the same severity of impact on ATC

operations. This tool identified the failures with the largest impact on ATC operations.

These are failures of the primary ATC functionality, which affect multiple

systems/tools/equipment either suddenly or gradually up to one hour in duration (see

Figure 4-9 and Table 9-2).

The process above, based on operational failure reports, the questionnaire survey, and

the qualitative equipment failure impact assessment tool, identified four potential failure

types. These are the failure of the flight data processing system, air situational display,

air-ground communication, and primary surveillance radar. These four candidate failure

types are further scoped by assessing their significance from the controller recovery

perspective but also their technical feasibility. In other words, the focus was on the

failures which require controllers to recover using only the systems available at their

positions. As a result, the pilot study simulated two different equipment failures. These

were a reduced flight plan mode as a part of the flight data and processing system and

air-ground radio communication failure.

Both failure types also conform to the requirements described in Chapter 5 (section

5.7.3) that the simulated equipment failure should allow one part of the diagnosis

phase of controller recovery to be performed overtly and thus be captured via

observations. For example, the flight data and processing system failure may be

initially thought as aircraft transponder or secondary surveillance radar failure.

Similarly, air-ground communication failure manifests itself in the same manner

regardless of its cause (i.e. ground- vs. airborne-based failure). In both cases, it is up to

the controller to identify the true failure by ruling out alternatives (e.g. communication

with pilot or adjacent ATC Centre) and this diagnostic process can be captured via

observations.


249

9.7 Pilot study: lessons learnt

Before conducting the main experiment, a pilot study was performed in order to

determine the feasibility of the experimental plan particularly with respect to the

serviceability of the equipment, ease of understanding of instructions, and logistical

issues. The study was designed to match the main experiment as far as possible.

Three controllers, selected at random and with no prior knowledge of the nature and

purpose of the experiment, participated in the study.

The pilot study was conducted on 2 November, 2005. It was part of a pre-planned

simulation, designed to test a newly restructured and reorganised airspace in the Area

Control Centre (ACC) of this particular ATC Centre. Of the three controllers who

participated in the pilot study, one was part of the airspace simulation test programme.

The others were volunteers who participated upon completion of their operational shift.

A total of three simulation runs were conducted. The first run was discarded due to the

inappropriate timing of the injection of the equipment failure.

The set up of the pilot study involved two Controller Working Positions (CWPs), with

the same simulation exercise running simultaneously on both CWPs. The participating

controller was located at one CWP, whilst the researcher and the SME occupied the

second CWP. In addition, a video camera was positioned in front of the second position

so that the controller would not be intimidated by its presence. The pilot study

simulated two equipment failures (Table 9-3) chosen based on the findings from

several sources (as discussed in section 9.6). There were no recovery procedures in

place for the first failure. The second failure has a defined procedure defined by

international aviation organisations (see EUROCONTROL, 2003f; ICAO, 2001a) but

not implemented within the respective ATC Centre.

Table 9-3 Equipment failures used in the pilot study

Type of failure Effect Existence

of recovery procedure

Human Machine Interface (HMI) indication on CWP

Reduced flight plan mode –

failure of flight data processing system

Monitoring aid available only for flight plan tracks already

displayed No

General Information Window/Flight Data

Processing (FDP) label changes from white to

yellow Flight data functions not

available

Aircraft radio communication

failure

Inability of the controller to contact aircraft on the

dedicated frequency as well as emergency frequency.

No (not in the ATC Centre)

None


250

Several important conclusions were drawn from this pilot study and the lessons learnt

were used to enhance the main experimental design. These are as follows:

� Integration of a research experiment into any kind of on-going ATC training requires

significant collaboration with training instructors, the engineer in charge, and an

ATM specialist (SME). In spite of thorough preparation, the injection of failure in the

first simulator run did not occur at the required instant due to the unclear

instructions given to pseudo pilots. This issue was corrected in the subsequent

runs. Therefore, for the main experiment a complete understanding of the set up of

the experiment would have to be ensured between the training instructor, engineer

in charge, pseudo pilots, and the SME in order to avoid any misunderstanding. This

should involve detailed discussions prior to the first simulation run of the day.

� The initial intention was to inject an equipment failure in the 25th minute of the

simulation run, in order to give the controller adequate time to adjust to the traffic

scenario. However, the first run showed that this timing was inappropriate for two

reasons. Firstly, the controllers were all very experienced and thus did not require

the proposed length of time to adjust to the traffic scenarios. Secondly, the traffic

scenarios used had a low number of aircraft in the dedicated sector from the 25th

minute onwards. This was contrary to the plan to inject an equipment failure during

the periods of average to high traffic density. Both problems were corrected by

injecting a failure in the 10th minute of the simulation run and observing the

controller recovery process while traffic increased progressively during the 30

minute runs. Since the main experiment was to use fully licensed and experienced

controllers, the exact moment of failure injection would have to be based on the

number of aircraft in the sector. The aim would be to initiate failure with traffic levels

starting with average and then progressing towards high.

� The need for access to the simulator log files was identified for the purpose of

capturing all of the inputs of the controller on the keyboard and HMI. The main

purpose for these log files would be to extract the precise reaction time of the

controller following detection of the equipment failure. However, difficulties were

encountered in the acquisition and decoding of these log files. Log files from

simulation platforms tend to have a specific format and level of detail too

cumbersome to decipher. In addition, initial detection may not necessarily be

captured in these log files (as an actual action). This is because controllers may

detect the failures but not take any action until they have evaluated the impact of

the failure on the operation. Having considered all the advantages and

disadvantages of using log files, it was decided to omit them. An alternative was


251

developed based on the use of a camcorder with a precise timing capability

(synchronised with the CWP timer). In addition, a debriefing session with the SME

was implemented to validate the data captured throughout the recovery processes.

The moment of detection was further validated through the results of the interviews

with the participating controllers in the debriefing session.

� The debriefing session revealed that some changes to the questionnaire used in

the debriefing session would be necessary. This would involve amending several

questions to extract more information from the participating controllers (e.g. traffic

and airspace related questions were to be presented in such a way as to extract

more detailed information on precise characteristics such as mix of traffic, vertical

movements, crossing movements, sector design, size of the sector, and number of

entry and exit points.

� Due to staff shortage (i.e. ATM experts) and the significant duration of the

experiment (three sessions spread across 11 days), it was not possible to access

two SME’s to observe the performance of each controller.

� It was possible to define required recovery steps for a simulated equipment failure

types and thus avoid a level of variability in each simulation run (as a result of

differences in experience, working strategies, traffic complexity at the instant of

failure injection, and inconsistencies in the pseudo-pilot inputs). The required

recovery steps are validated by the SME.

� Several issues of a more technical nature were recognised: a need for the use of a

voice recording device in the debriefing stage of the experiment as a more efficient

means of capturing the controller responses, the need for two camcorders or a

combination of one camcorder and radar replay for the debriefing session, and the

need for the use of 8mm tape camcorder instead of digital camcorders due to the

higher resolution achieved in recording and replay.

� Another factor of note was that the controllers tended initially to stop their work

when a failure occurred. This was because they felt this was a software

glitch/bugging error, common to real-time simulations. Therefore, the instructions

were to be updated to inform the controllers that in the case of any unusual event

they are expected to continue working as they would in the operational

environment. The experience of ATM specialists showed that although the

controllers may anticipate an unusual occurrence, this does not facilitate a better

handling of the occurrence (for evidence see Appendix II). Therefore, it was

assumed that prior warning of some unusual situation may not alter or enhance

controller recovery performance. It was more important that participating controllers


252

did not have advance knowledge of the nature of that unusual occurrence, i.e. ATC

equipment failure.

� Because of the great amount of data and observations to be collected, it was

realised that the main experiment would require an assistant. The primary task of

the assistant would be to observe and take notes/recordings of the controller’s overt

behaviour and attitude.

� Finally, although the simulation runs in the pilot study were designed to reflect high

traffic levels, failures were injected during a period of average to low traffic.

Additionally, no adverse weather was simulated, which would add to the complexity

of the exercise. As a result, the traffic scenario in the main experiment would

necessitate high traffic levels from the moment of failure injection throughout the

duration of the exercise. Additionally, adverse weather could be simulated resulting

in the unplanned rerouting of air traffic.

9.7.1 Summary of the findings from the pilot study

As a result of the findings from the pilot study and subsequent discussions with

technical staff and the SME, the following lessons were learnt and used to enhance the

main experimental study:

� A complete understanding of all details on the experimental set up has to be

ensured between the training instructor, engineer in charge, and the SME. In this

manner it is possible to provide a consistent injection of failure, adverse weather

conditions, and timely recordings for each simulation run of the main experiment.

This would require detailed discussions prior to the first simulation run of the day.

� In the main experiment the failure should be injected in the tenth minute of the

simulation runs, when the traffic reaches average levels and progresses towards

higher traffic levels.

� The main experimental set up would require an assistant to observe and take

notes/recordings of the controller’s overt behaviour and attitude.

� The main experimental set up should be based upon one traffic scenario with

average to busy traffic and adverse weather conditions (pseudo pilots should be

briefed to ask for rerouting due to adverse weather conditions); and

� The pilot study tested two different equipment failures. Both failure types showed

the potential for the experiment. However, the flight data processing system failure

was chosen for the main experiment as it is more demanding from the controller

recovery perspective. The failure would be injected as a sudden failure in the tenth

minute of each simulation run and it would last for 15 minutes.


253

The following section discusses the process adapted to set up the actual experiment

including a description of the characteristics of the simulated airspace, traffic, and

equipment failure type.

9.8 Experimental set up

The main experimental study was conducted in an ATC Centre (different from the one

used in the pilot study) in three separate sessions: from November 29 to December 1,

2005, from February 27 to March 02, 2006, and from June 06 to June 09, 2006 (Table

9-1). The reason for choosing a different ATC Centre to the one used for the pilot

study, was to access a larger population of controllers and required simulation facilities.

There were several differences in the set up of the main experimental study when

compared to the pilot study. The differences are presented in the following paragraphs.

Note that the other design specifications were maintained as given in section 9.5.

The population for this experiment should consist of the controllers from the ATC

Centre where the experiment was to be carried out. The population characteristics to

be sampled in this experiment are age, operational experience (i.e. years in service),

and rating of the controllers. Based on the statistical characteristics of human (i.e.

controller) performance and potential modelling with the normal distribution, the

minimal number of simulation runs (and thus participants) would be 20 (Shier, 2004).

However, collecting a larger sample of controller recovery performance poses a

significant challenge because of accessibility (to both controllers and a simulator

facility) and other logistical problems.

As a result, the study had a total of 31 simulation runs (eleven runs in the first session,

ten runs in the second and third session) performed on the Beginning to End Skills

Trainer (BEST) simulation platform. The main study was conducted in collaboration

with various staff from the ATC Centre. They were: one ATM specialist taking the role

of the Subject Matter Expert4 (SME), technical staff supporting the simulation runs,

several pseudo pilots, and total of 31 controllers. All three sessions were designed to

be as similar as possible in a given ATC environment.

4 The SME participating in this study is an ATM Specialist with 20 years of experience in many

facets of ATC and has 15 years of experience as an ATC instructor.


254

As mentioned previously, each simulation run was of approximately 30 minutes

duration, followed by a debriefing session of a similar duration. The experiment

(executed according to the timeline in Figure 9-2) used a pre-planned training exercise

modified for experimental use. After the first simulation run (which was discarded

afterwards), the exercise was amended to reproduce a busier traffic environment. In

other words, several arrivals were accelerated to achieve a busier period from the 10th

to the 25th minute of the exercise. FDPS failure was consistently injected in the 10th

minute of each run by pseudo pilots who manually de-correlated each new radar track.

In addition, pseudo pilots were instructed to simulate adverse weather conditions en

route by asking for necessary rerouting from the controller. Weather conditions were

scheduled for the fifth and fifteenth minute of the run. The FDPS was consistently

restored in the 25th minute of each run (see Figure 9-2).

Figure 9-2 Timeline of the experiment

The recovery process did not end with the restoration of the equipment (the 25th

minute) due to several steps that the controller had to perform to assure equipment

reliability and hence the readiness for the restoration of normal service. It usually took

one minute to accomplish these post-restoration steps. Additional time was given to

controllers in the post-restoration part of the simulation run (from the 25th to the 30th

minute of the run) to restore their normal working strategy and to calm down after the

effects of a highly stressful equipment failure occurrence.

The SME involved in the study as an observer also acted as a coordinator to issue any

relevant information about the failure and its effect on the entire ATC Centre. This

notice was issued in response to queries from the participating controllers. However, if

a controller did not make any attempt to contact the coordinator, the SME issued this

information at the most suitable moment during the exercise (based on the level of the

controller’s workload).

Each simulation run was observed by the researcher, the assistant, and the SME; and

recorded for the purpose of further data analysis. The assistant was mainly responsible


255

for taking notes of the controllers’ overt behaviour prior to and after injection of failure.

A check-list using the SHAPE5’s list of attitudes was used to guide the assistant in

performing this task (EUROCONTROL, 2004f). The assistant was positioned in the

least intrusive way to the controller, completely outside of his/her field of view. On most

occasions, the observation team was positioned as far from the controller’s field of view

as possible, whilst still having a clear view of the radar screen. The precise set up of

the simulation room in which the experiment took place and the positions of all parties

involved are depicted in Figure 9-3.

Figure 9-3 Room set up

The simulation runs were followed by an immediate debriefing session guided by the

questionnaire and other material designed specifically for this session. The controllers

were asked to evaluate all the factors that potentially influenced their recovery

performance. In addition, they were given an opportunity to judge their own

performance and the realism of the exercise itself. The questionnaire and other

material designed for the experiment and the debriefing session is presented in the

Appendix XIII.

Equipment failure in ATC, as any other unusual or emergency event, represents a

highly stressful event. In these instances the controllers are required to intervene with

complex strategies and employ their knowledge under significant pressure and high

psychological stress. For this reason, the debriefing session was used to help diffuse

stress by creating a relaxed interview environment where the participating controllers

could evaluate their actions and performance. This session was structured in such a

way as to enable comparisons across the participants. For this reason, a special

5 SHAPE project is briefly explained in Chapter 7, section 7.3.1.3. List of attitudes used to guide

the assistant in the experimental process was derived from SHAPE attitude items, such as attentive, active, confident, thoughtful, calm, careful, and enquiring.


256

debriefing sheet had been designed prior to simulation runs. The rationale behind this

structured approach to debriefing was to ensure a consistent and reliable acquisition of

data on controller recovery performance. The debrief segment of the experiment was

used to confirm and detail observations made during the simulation run via an

approach similar to a “cognitive walkthrough”. In other words, this part of experiment

was used to discuss the sequence of recovery steps required by a controller to

accomplish a recovery, and to validate failure detection and the factors that influenced

each stage of the recovery (i.e. detection, diagnosis, and correction; further discussed

in Chapter 10).

The following paragraphs give a brief description of the key elements of the

experiments in terms of airspace, traffic, and failure characteristics.

9.8.1 Airspace characteristics

The approach airspace of the ATC Centre where the experiment was carried out is

designated as class “C” airspace. This airspace extends horizontally over a radius of

30Nm from the airfield (runway 06/24, instrument landing system - ILS equipped on

both runway ends). The vertical limits are from the surface to 8,000 ft or FL80.

However, in the case of an early handover from area control, the area of responsibility

of the approach control increases. For example, if an aircraft is handed over at FL180

descending to FL80, all of the airspace in between becomes the responsibility of the

particular approach sector. On a scale of one (adequate airspace) to three

(inappropriate airspace) the participating controllers ranked this airspace as 1.31 on

average, which translates to airspace of adequate to tolerable complexity (Table 9-4).

In addition, a series of in-depth questions on airspace characteristics were presented to

each controller to identify the specific features of this airspace. The most frequently

observed issues with traffic complexity were:

� that there were a variety of flight levels and altitudes utilised (from FL100 down to

FL90, 4500ft, 4000ft, 3500ft, 3000ft);

� that there were no specific entry and exit points (throughout the duration of this

experiment this particular airspace did not provide for any standard instrument

departure and arrival routes, i.e. SIDs and STARs); and

� that the complexity of the neighbouring sectors did influence complexity within the

approach sector they operated in (e.g. two neighbouring sectors have large

numbers of crossing traffic).


257

Table 9-4 The mapping between exercise characteristics and the controllers observations

The exercise characteristics The controllers observations

Airspace characteristics simulated as adequate Adequate to tolerable Weather conditions simulated as unchanged (pre- and post-failure)

Unchanged

Traffic characteristics simulated as high Average to high

In addition, the weather conditions in the exercise simulated 15-25 knots southwest

wind, rain showers, half of the sky covered with cumulonimbus cloud (i.e. thunderstorm

cloud) with base at 1800ft, temperature of two degrees Celsius, and the pressure at

mean sea level (MSL) of 1032 hPa. Generally, in these conditions, icing will occur

inside cloud above 2000ft (in the ICAO standard atmosphere the temperature

decreases on average by 2 degrees Celsius/1000ft). Since the weather conditions pre-

and post-failure injection remained unchanged (i.e. re-routings requested by pilots in

both cases), the overall weather was marked as unchanged. This was confirmed by the

SME and participating controllers (Table 9-4).

9.8.2 Traffic characteristics

The exercise used in this experiment had a duration of 30 minutes and a total of 14

flights (one training aircraft, ten arrivals, and three departures), which translates to 28

aircraft per hour. In the peak segment of the training exercise, the controller was in

simultaneous radio contact with seven to eight aircraft. On a scale of one (high

complexity) to three (low complexity) the participating controllers ranked the traffic

complexity as 1.66 on average. This rating translates to average to high traffic

complexity (Table 9-4). In addition, a series of in-depth questions on traffic

characteristics were presented to each controller to identify the traffic characteristics

mostly observed in the given traffic scenario. These were:

� aircraft speed mix or the difference in indicated airspeeds ranging from 125 knots to

250knots (i.e. the speed read directly from the airspeed indicator on an aircraft);

� the utilisation of hold and thus induced delays;

� only Instrument Flight Rules (IFR) aircraft utilising the airspace;

� high volume of traffic with vertical and crossing movements; and

� an average flight time in the sector of 10-15 minutes (longer than usual due to the

injected equipment failure).

9.8.3 Equipment failure characteristics

The choice of the equipment failure was driven by the previous analyses and four

different sources of information (operational failure reports, questionnaire survey, the


258

qualitative equipment failure impact assessment tool, and the pilot study). The FDPS

failure was chosen for this experimental set up for several reasons. Firstly, the data

available showed that this failure is both severe and frequent. Secondly, this failure

represents an example of major failures that affect multiple systems, as seen from the

qualitative equipment failure impact assessment tool. Thirdly, the participating CAA

does not have a written procedure for this particular failure which makes the controller

recovery performance more dependable upon their knowledge, experience, and

personal abilities. Finally, the technical features of the Beginning to End Skills Trainer

(BEST) platform allowed injection of this failure type and its restoration in a fairly easy

way. In order to simulate equipment failure in the most realistic way, it was necessary

to have the ability to inject failure but also to restore system functionality rapidly. This

was possible with the FDPS failure and its degradation was simulated as a sudden

failure affecting the entire ATC Centre for a period of 15 minutes.

A visual representation of this type of equipment failure on the BEST platform is

presented in Figure 9-4. Correlated radar track with all relevant flight-related

information is presented on the left-hand side of Figure 9-4, whilst the uncorrelated

track (resulting from the FDPS failure) depicting only the aircraft position is on the right-

hand side. It can be seen that the FDPS failure represented a failure which affects

multiple systems. The actual effects of the FDPS failure are presented in the Table 9-5

and in more detail in Table 9-6.

(a) (b)

Figure 9-4 The visual representation of equipment failure on CWP: a) before the failure, b) after

the failure

Table 9-5 Equipment failure in the experimental study

Type of failure

Effects Existence of

recovery procedure

HMI indication on BEST simulation

platform

Reduced flight data processing

mode

Monitoring aid only available with existing flight plans

No None

Flight data functions (flight plan management) not available

Safety Nets functions available

Radar data functions available

CALLSIGN TYPE

AFL XPT GS

CFL XFL ADES


259

Table 9-6 Availability of functions in the reduced flight data processing mode

Radar data source

Radar tracks Available Flight plan track Only for flight plan tracks already displayed

Maps Available Tools Available

Radar picture controls Available

Flight plan facilities

Flight plan commands Not available Flight plan lists Partially available (for display only, frozen lists)

ATC messages de-queue management

Not available

Transmission of ATC messages Not available Coordination message Not available

Alarm and warning facilities Partially available (no MTCA warnings update) General information area Available

Mail box management Not available

Operational data management Partially available (runway in use and airspace

management are not available) Sectorisation Partially available (only displayable)

Aeronautical Information System Available Load management facilities Not available

Air Traffic Flow Management facilities

Not available

Operational load forecast facilities Not available Current Operational Load facilities Not available

System survey facilities Partially available (percentage of use of SSR code

indication that a flight plan has received message is incorrect and alerts are not available)

Operational room configuration Partially available (only displayable) Manual printing facilities Available

Operator roles (eligibility rules) Partially available (only displayable) Off-line customisation Available

User mode of ATC position Available Repetitive flight plan database

version management Not available

9.9 Experimental variables

The following sections define the variables that were taken into account in the design of

the experiment to capture the characteristics of the recovery process in ATC. They are

defined as independent, dependent, and extraneous variables (see Table 9-7 and

Table 9-8) and discussed in the following sections.

Table 9-7 Overview of independent and dependent variables

Independent variable Dependant variable

Set of 20 RIFs The recovery context (recovery context

indicator) The required recovery

steps The recovery effectiveness

The recovery duration


260

9.9.1 Independent Variables

There are two sets of independent variables in this experiment. These are the

Recovery Influencing Factors (RIFs) and required recovery steps, discussed in the

following sections.

9.9.1.1 Recovery Influencing Factors (RIFs)

The research carried out in this thesis includes an assessment of the factors that

influence controllers during the process of recovery from equipment failures in ATC (i.e.

RIFs; see Chapter 7). A total of 20 relevant factors (RIFs) were identified. During the

post-experiment debriefing session each participating controller was presented with the

questionnaire. This questionnaire enabled controllers to mark and briefly explain the

influence of each RIF on their recovery performance as experienced in the simulation

run. Although it would be beneficial to question controllers on their experience with the

interactions between RIFs, this would considerably increase the complexity of the

experimental design. Therefore, the statistical approach is taken instead (presented in

Chapter 8).

Table 9-8 briefly summarises each of the 20 factors, specifying the key considerations

taken into account in the design of the experiment. Each factor is defined as either

independent or extraneous variable. Seven RIFs were kept constant for all participating

controllers (Table 9-8), whilst two RIFs were not considered in this experiment (i.e.

‘adequacy of alarm’ and ‘adequacy of alarm onset’).


261

Table 9-8 Overview of independent and extraneous variables

Variable Independent

variable Extraneous

variable Comment

Training for recovery √ Assessed in the debriefing session.

Previous experience with equipment failures

√ Assessed in the debriefing session.

Experience with system performance


Personal factors √ Assessed in the debriefing session.

Communication for recovery √

Existing studies from the nuclear industry have confirmed that communication within a team does have a significant impact on recovery performance (Kaarstad and Ludvigsen, 2002). Hence, the impact of this factor is fairly well known. Regardless, this variable will be assessed after the experiment.

Complexity of failure type

Constant (multiple systems affected)

Refers to single vs. multiple failure occurrences. The experimental set up should assess the impact of one failure which affects multiple ATC systems. Therefore this variable will be constant for all subjects.

Time course of failure development Constant (sudden failure)

This variable varies between sudden failure and gradual degradation of the system. This variable will be constant for all subjects.


Constant (all workstation

affected)

Experiment is conducted on a single workstation with one controller at a time. But the controller will be informed that the failure affects the entire ATC Centre.

Time necessary to recover √

This variable varies between adequate and inadequate time to recover. It can be influenced by several factors. Firstly, the characteristics of a given failure will drive the time necessary to recover through the criticality of the failed function and its detectability. Secondly, the controller characteristics will also have an effect. More experienced controllers may react and resolve an issue more quickly than less experienced ones. Finally, the characteristics of traffic at the moment of failure will drive the time necessary to recover. The more complex the traffic situation, the more recovery time will be needed to the controller. This variable will be assessed in the debriefing session.

Existence of recovery procedure Constant (no procedure)

Theoretical review and various experiments in other safety-related industries have confirmed the relevance of procedures to recovery performance (Kaarstad and Ludvigsen, 2002; EUROCONTROL, 2004e; Kanse, van der Schaaf, 2000). Therefore, it was decided to choose a failure which does not have an appropriate recovery


262

procedure.

Duration of failure

Constant (short

duration – 15min)

In the experimental set up, duration of failure should be long enough to capture all phases of the recovery (e.g. 15min) taking into account the total duration of experiment.



Ambiguity of information √ Assessed in the debriefing session.

Adequacy of alarms/alerts Not applicable for technical

reasons

The experimental design aims to capture controller performance unaided by system tools, emphasising more controller readiness to detect and react to unexpected occurrence. Additionally, past research have already shown that in most cases the existence of an alert does have a significant impact on recovery performance (Kaarstad and Ludvigsen, 2002; Theis and Straeter, 2001).

Adequacy of alarm/alert onset Not applicable for technical

reasons

Existing studies from various industries have confirmed that the alert onset or its ‘cognitive convenience’ does have a significant impact on recovery performance (Straeter, 2005).

Adequacy of organisation √ Assessed in the debriefing session.

Traffic complexity Constant

(average to high)

This variable will be kept constant for all subjects. The aim is to reflect the current levels of traffic as well as the future predicted traffic increase. The declared sector capacity is defined as the number of aircraft entering the sector per hour, respecting the peak hour pattern, when controller workload is 70 percent in that hour (Majumdar and Ochieng, 2002). Therefore, the aim of the proposed experimental set up is to use a 30-min peak hour traffic sample that adequately reflects the sector’s declared capacity. In addition, the scenario should aim at steady traffic increase up to the tenth minute into the scenario. The remaining 20 minutes of the scenario should reflect higher levels of traffic as well as controller workload.

Airspace characteristics √ This variable will be constant since each participant will experience the same airspace/sector characteristics. However, each controller will be able to assess the adequacy of airspace in the debriefing session.


Constant This variable will be constant for all participants. Poor weather conditions will be experienced both pre- and post-failure period.

Conflicting issues in the situation √ Assessed in the debriefing session.

Age √ Assessed in the debriefing session.

Overall experience as a controller √ Assessed in the debriefing session.

Required recovery steps √ Set of required recovery strategy steps will be defined prior to the experiment based on the type of failure, traffic sample, and airspace characteristics.


263

9.9.1.2 Required recovery steps

The recovery performance of each participant was compared to the pre-determined set

of required recovery steps. These recovery steps were determined on the basis of

operational experience, since the participating Civil Aviation Authority (CAA) does not

have any official guidelines for this particular failure type (e.g. procedure, written

instruction). This set of required recovery steps was validated by the independent input

of the SME and two ATC instructors. It should be noted that controller performance

was highly dependent upon the traffic situation at the moment of failure and therefore

several different sequences of the recovery steps were possible. The list of the

seventeen recovery steps presented in Table 9-9 presents one logical sequence of the

recovery steps. Whilst some steps had to be performed only once (e.g. identification of

a failure type, informing the coordinator, and post restoration), others had to be re-

applied. For example, for each new (uncorrelated) track entering the dedicated

airspace, it was necessary to identify the traffic and maintain that identification. In

addition, timely and accurate strip marking was a must especially in the situation of

degraded equipment reliability, as simulated in this experiment. A detailed evaluation of

strip management and annotations should be addressed in future research.

An important point to note is that these simulation runs were not entirely identical in

spite of the great effort to achieve consistency amongst participants. The observed

differences were due to pseudo pilots’ manual actions, namely their incorporation of

requested weather rerouting and slight deviations of the moment of failure injection. In

short, pseudo pilots had to manually de-correlate each new track which influenced to

some extent the traffic distribution in each simulation run.

Due to the small differences in the simulation runs, further analysis focused only on the

list of required recovery steps (Table 9-9), irrespective of their sequence. The objective

was to capture these core steps (including the post-restoration steps, S14-S17) and

evaluate any deviations.

Table 9-9 Overview and description of required recovery steps

Required recovery step

Description

S1

Detect the problem either by pilot’s contact or visually on the radar display (detection of the uncorrelated track). In both cases, the first assumption may be a transponder failure. After confirmation that the aircraft transponder is operational, further check on ATC system performance should be conducted.

S2 Locate traffic


264

S3 Check identity of eastbound overflight S4 Identify all traffic using appropriate technique

Bearing/range or Turn method (turning the aircraft for 30 degrees or more)

S5 Identify failure type (either by controller or by coordinator) S6 Inform all traffic on RTF of the failure and advise of possible restrictions S7 Maintain identification of all traffic S8 Ground the trainer S9 Refuse departing traffic permission to depart

S10 All airborne traffic in inbound sequence should continue to be sequenced for landing (without unnecessary delay)

S11 Maintain accurate and timely strip marking throughout the process S12 Provide vertical separation S13 Utilise holding patterns when necessary S14 After restoration has been confirmed by coordinator re-identify all traffic S15 Confirm Mode C S16 Continue to monitor S17 Release all departures (which leads to the restoration of the normal service)

It is important to state the some of the recovery steps above are of greater importance

to maintaining a safe ATC service than others. For example, maintaining identification

of all traffic, conducting timely and efficient strip marking and board management, and

maintaining separation are considered critical to overall safety in a degraded situation.

Other recovery steps, such as grounding the trainer and preventing departures, are of

less importance in that they are workload reduction measures. Nevertheless, their

implementation contributes to a safer traffic environment in unusual situations.

9.9.2 Dependent Variables

This study was designed to capture several quantitative and qualitative dependent

variables. The reason for this lies in the fact that controller recovery cannot be captured

through only one recovery variable as highlighted previously in Chapter 5. The

dependent variables in this experimental set up are recovery context (recovery context

indicator), recovery effectiveness and recovery duration (see Table 9-7). The precise

methodology for the assessment of the recovery context both as a qualitative and a

quantitative variable is presented in Chapter 8. The following sections investigate other

variables.

9.9.2.1 Recovery effectiveness

The recovery effectiveness of each participating controller was rated by combining

three separate sources of data. Firstly, each participant’s recovery performance was

rated during the simulation run. In general, this analysis was based on the performance

indicators for a particular airspace, such as optimal use of airspace (separation of 5-

8Nm), radar vectoring, speed control, use of radio telephony (RT), prioritisation of


265

tasks, and appropriateness of traffic management. Secondly, the recovery

effectiveness was rated based on a set of required recovery steps as explained in

9.7.1.2. Thirdly, the steps identified earlier were grouped under three main tasks to

enable credible rating (see Table 9-10). These are:

� System protection or recovery steps which aimed to assure protection of the ATC

system in case of further equipment deterioration. Note that the reduction of

controller’s workload through better traffic management is an integral part of system

protection and as such is included in this task;

� Maintaining situational awareness (i.e. accurate mental picture of traffic and

airspace); and

� Post-restoration recovery steps.

Table 9-10 Recovery process and its three main tasks

System protection task SA or mental picture task Post-restoration task

Ground the trainer Detect the problem Re-identify all traffic Refuse departures permission

to depart Identify failure type Confirm Mode C

All airborne traffic in inbound sequence should continue to

be sequenced for landing

Maintain accurate and timely strip marking

Continue to monitor

Utilise holding patterns when necessary

Identify all traffic (including eastbound overflight)

Release all departures

Inform all traffic and advise of possible restrictions

Locate traffic

Provide vertical separation Maintain identification of all

traffic

It should be noted that an assessment of controller performance is not a simple task of

counting the number of recovery steps performed versus the total number of required

steps. The reason for this lies in the different effects that each step has on the overall

recovery performance. Therefore, three sources of information enabled a structured

recovery assessment of each participant using the following five categories:

� Very good recovery performance (VG) - the controller employed a very good

recovery strategy and all recovery steps;

� Good recovery performance (G) - the controller employed a good recovery strategy

but failed to perform some of the steps;

� Adequate recovery performance (A) - the controller employed an adequate

recovery strategy but failed to completely protect the ATC system in case of further

equipment deterioration and failed to implement some of the post-restoration steps;

� Partially adequate recovery performance (PA) – the controller employed inadequate

recovery strategy. In other words, there was a complete lack of ATC system


266

protection from possible further equipment degradation. In addition, the controller

did not assure timely and accurate strip management and therefore had no means

to support his/her situational awareness or mental picture of the traffic and

airspace. The post-restoration steps were performed only to some basic extent

without a proper check of the accuracy of new data; and

� Inadequate recovery performance (I) – the controller had no recovery strategy in

place, no plan to reduce his own workload, and therefore, failed to protect the ATC

system in the case of further equipment deterioration. In addition, the controller

failed to implement most of the post-restoration steps.

Although not attempted in this thesis, future research should assess the relevance and

contribution of existing tests such as the situational awareness test – SAGAT, to the

assessment of controller recovery.

9.9.2.2 Recovery duration

As previously discussed in Chapter 5, the recovery duration is measured as the time

from the first controller overt action to the end of the recovery process. The

measurement starts from the first controller overt action as opposed to the moment of

actual failure detection although they can differ significantly. Identifying the moment of

the failure detection can be an extremely difficult task as this first reaction usually

represents covert behaviour (i.e. detection) not directly observable. In the current

experimental set up and with the available apparatus, it was not possible to accurately

capture the moment of failure detection but only the controller’s first action as observed

on the ATC system.

More sophisticated equipment, such as an eye movement tracker (e.g. ASL Model

501), offers a better, but still not entirely accurate, approach to the discrimination of the

moment of failure detection. The reason for this is that there is no integrated measure

of eye point of gaze and brain activity which would differentiate between fixations with

information gathering and ‘stares’, when no information has been gathered6. Therefore,

even with the use of this advanced eye tracking equipment, it would not be possible to

firmly state the precise moment of failure detection. Whilst the moment of failure

6 Personal correspondence with human factors experts from Netherlands National Research

Laboratory (NLR) and EUROCONTROL Experimental Centre (Human Factors Lab).


267

detection was investigated during the post-experimental debriefing, it still proved to be

difficult to determine.

For this reason, the research presented in this thesis uses the first controllers’ action to

measure the recovery duration. It is necessary to highlight that this first observable

action may be postponed for two generic reasons. Firstly, the controller may not

necessarily detect the uncorrelated track as soon as it becomes visible on the radar

display. Secondly, the controller may detect it immediately (upon its presentation on the

radar display) but consciously delay any action due to the workload experienced or the

presence of a more urgent task which needs to be addressed first. For example, the

controller may need to address some of the tasks that are completely unrelated to the

recovery process, namely turning the aircraft to intercept the ILS localiser for the

approach and landing, radar vectoring of the traffic with speed differential. In other

words, the controller’s first action is the moment when the controller decides to initiate

an appropriate recovery strategy and not necessarily the actual time when he/she

detects the uncorrelated label. It is well known that controllers develop their own

working strategies concurrently with gaining experience and proficiency with years on

the job. This results in the gradual built up of ‘personal criteria’ for separation limits and

methods for solving the potential conflicts (whether it is to change speed of the aircraft,

its flight level, or heading).

Based on the moment of the controller’s first action, the recovery duration was

determined by observation of simulation runs and recorded video/audio material. It

should be noted that controller recovery performance did not stop with the restoration

of FDPS service, but continued to include all necessary post-restoration steps. The

post-restoration steps are required to restore normal service and to confirm that the

restored functionality provides accurate information. Discussion with the SME revealed

that this stage of the recovery should take up to one minute in duration, simply to limit

the recovery duration for the controllers who fail to perform all post-restoration steps.

As a result, the recovery duration was directly influenced by the duration of the failure

(15 minutes) and the period required for the post-restoration phase (one minute). Thus,

the recovery duration could reach a maximum of 16 minutes only if the controller

immediately initiates recovery action(s). The more time it takes for the controller to

initiate recovery action, the shorter the recovery duration will be.

The results of all three sources of information as well as the final rating for each

participant were confirmed by the one SME involved in the experiment. Clearly, having


268

the participation of more SMEs would increase the validity of the outcome of the

experiment. Future research should address how statistical representation could be

achieved given the logistical difficulties associated with these types of experiments.

9.9.3 Extraneous Variables

Extraneous variables influence the outcome of an experiment, although they are not

the variables of interest. These variables are undesirable because they add errors to

the experiment. A major goal in the experimental design is to eliminate the influence of

extraneous variables as much as possible. If it is not possible to eliminate them, they

should be controlled. Two extraneous variables in this experiment could not be

controlled. These are:

� Operational experience (i.e. years in service)

The differences in the level of experience were to be captured once the controllers are

recruited for the experiment. The experience variable is differentiated between the

following categories: 1-10; 11-20; 21-30; and 31-40 years.

� Personal factors

There is a wide variety of factors that could be categorised as personal. Some of these

are more complex to determine than others. For example, factors like health, vision,

level of confidence, complacency, level of trust in automation, self esteem (i.e. trust in

own ability), personality, motivation, attitudes deriving from family or close social group

personality type, etc. require specific sets of tests which can be too complex and too

time consuming. However, age was to be captured once the controllers were recruited

for the experiment. Fatigue and stress were to be controlled by using rested controllers,

similar as ‘time of the day’ (i.e. relevance of circadian rhythm) and time into the shift

(i.e. level of situational awareness as well as fatigue). In short, the experiment was to

be conducted in the same periods of the day, where half of the subjects were to be

tested in the morning sessions, and the other half in the afternoon sessions.

9.10 Potential limitations

There are two limitations of the experimental set up and its use to capture data. Firstly,

one limitation is the individual differences of the participants (i.e. controllers). These are

characteristics that differ from one participant to another which could be overcome by

using random assignments or even matching groups (to ensure that different groups

are equivalent with respect to pre-selected characteristics (e.g. experience and age).

Secondly, validation of recovery performance of each participating controller by only

one SME creates a potential for bias. Although special attention has been given to the


269

choice of the SME (in terms of experience and expertise), still only one SME was

available for this experiment.

9.11 Summary

This Chapter has presented in detail the experiment designed to capture controller

recovery in ATC. The Chapter started by justifying the need for the field experiment.

This was followed by an assessment of the available resources and the key

requirements that had to be accomplished. The Chapter continued by discussing and

justifying the overall experimental set up and data acquisition. This included the

presentation of the rationale for the choice of the equipment failures to be tested in the

pilot study. After the lessons learnt from the pilot study, it was possible to implement

the final changes and fine tune the set up of the main experiment. This segment

focused on the characteristics of the simulated traffic, airspace, and equipment failure,

as well as on the research variables while highlighting potential limitations. The

following Chapter analyses the data captured from this experiment.

Chapter 10 Analysis of Experimental Results

270

10 Analysis of Experimental Results

The previous Chapters identified a set of relevant contextual factors or Recovery

Influencing Factors (RIFs) and developed a novel approach for the quantitative

assessment of the recovery context. This approach and its operational benefits are

further verified in this Chapter by an experimental investigation conducted in a training

facility of an Air Traffic Control (ATC) Centre with the participation of 30 operational air

traffic controllers. In addition to the assessment of the recovery context, the

experimental data are used to assess controller recovery performance using the

recovery variables identified in Chapter 5.

The Chapter starts with the overall framework for the analysis of a unique set of data

on controller recovery performance. This is followed by the analysis of the

characteristics of the sample of controllers participating in the experiment. The Chapter

continues with an assessment of controller recovery performance using three recovery

variables, namely recovery context, duration, and effectiveness. It concludes by

focusing on the outcome of the recovery process, as captured in the experiment.

10.1 Overall framework

The objective of the experiment conducted in this research is mainly to capture data

related specifically to controller recovery from equipment failure in ATC. Based on the

experimental set up (presented in Chapter 9), three experimental sessions were

conducted with 30 controllers from a particular ATC Centre who participated on a

voluntary basis. The controllers were asked to complete one emergency training

session (based on a simulated Flight Data Processing System-FDPS failure), followed

by a debriefing session.

The framework for the analysis of data collected on controller recovery from a FDPS

failure is structured according to Figure 10-1. It starts by assessing the characteristics

of the controllers who participated in the experiment. This is followed by a detailed


271

analysis of the recovery variables defined in Chapter 5, their interactions, and other

relevant findings obtained form the experiment.

Participants

Recovery context indicator

Recovery effectiveness

30 operational air traffic controllersOne particular ATC CentreSimulated Flight Data Processing System (FDPS) failure

Analyses of recovery variables

Recovery context

Required recovery steps

The recovery phases

Observed behaviour and

attitude

Additionalfindings

Analysis of interactions

Analyses of dependent variables

Recovery duration

Experimental results

AgeOperational experienceRatings

Outcome of the recovery process

Other findings

Figure 10-1 Framework for the analysis of experimental results

10.2 Participants

As discussed in section 9.8 (Chapter 9), it is important that statistical representation is

achieved in research that involves sampling of the population. In this case, such

representation is required for the ATC Centre where the experiment was to be carried

out. The main distinguishing characteristics of the controllers are age, operational

experience (i.e. years in service), and rating. This section analyses these and makes a

link to statistical representation.


272

10.2.1 Age and operational experience

The average age of the controllers who participated in the experiment is 37 years,

ranging from 24 to 58 years. On average, they have more than 12 years of operational

experience, ranging from 2 to 35 years. Figure 10-2 shows the distribution of

operational experience of sampled controllers in terms of the four categories adopted

for the questionnaire survey in Chapter 6. It can be seen that the sample is reasonably

representative of the population of controllers in the particular ATC Centre as all

experience categories have been represented. The under representation of controllers

with over 30 years of experience is to be expected as the majority of the controllers in

this category tend to move to operational support roles (e.g. ATC instructors). This

finding is in line with the results of the questionnaire survey (Chapter 6) where there

were fewer respondents with over 30 years of experience.

Figure 10-2 Distribution of operational experience

10.2.2 Ratings

Figure 10-3 presents the distribution of the ratings of the controllers who participated in

the experiment. Considering that the training exercise was designed for the approach

control course (APP), it is important to highlight that 20 percent of the participants did

not have APP rating. However, half of these participants had ACC rating which

incorporates training in elements of approach control (as a part of the low level ACC

course). Although the remaining participants had only TWR rating, they had just


273

completed an APP course and therefore possessed knowledge of all relevant elements

of approach control.

All - ACC APP TWR

ACC and APP

ACC and TWR

APP and TWR

ACC APP TWR

Ratings

0

10

20

30

40

Pe

rce

nt

36.7

26.7

3.3

10

6.7 6.7

10

Figure 10-3 Distribution of controllers’ ratings

Since the experiment was conducted in three separate sessions (as discussed in

section 10.1), it is important to investigate whether the sampling on all three occasions

was appropriate. In other words, it is important to show that all three sessions come

from the same population of controllers from the ATC Centre, and that aggregated,

they represent a proper sample (Table 10-1).

Table 10-1 Characteristics of a sample of controllers participating in experiment

Variables Experimental session

1 Experimental session

2 Experimental session

3

Age (mean, standard deviation)

M=35.9, SD=8.95 M=37.9, SD=10.3 M=37.7, SD=9.73

Experience (mean, standard deviation)

M=10.7, SD=6.70 M=14.3, SD=11.08 M=13.7, SD=8.22

Category of experience (frequency)

1-10 5 5 4 11-20 4 2 5 21-30 1 2 0 31-40 0 1 1

The Mann-Whitney non-parametric test was used to investigate the differences

between age and operational experience of controllers from the three experimental


274

sessions. Details of this statistical test are presented in Chapter 6, section 6.7.4. The

statistical tests1 at 95 percent confidence level indicated that there is no difference

between the three experimental sessions (p>0.05). Based upon this, data were pooled

for further analyses.

10.3 Assessment of controller recovery performance

The main objective of the research presented in this thesis is to investigate controller

recovery from equipment failures in ATC. The discussions in Chapter 5 concluded that

the assessment of controller recovery needs to assess the recovery context,

effectiveness, and duration, followed by the assessment of the outcome of the recovery

process. The section continues with an analysis of the interactions between recovery

variables and concludes with the discussion of other relevant experimental findings.

10.3.1 Recovery context

The thesis used a set of RIFs, identified in Chapter 7, to develop a novel approach for

the quantitative assessment of the recovery context through the concept of a recovery

context indicator (presented in Chapter 8). The experiment carried out and presented in

Chapter 9 attempts to verify this approach and its operational benefits. The following

sections adapt the proposed methodology to the particular environment of the ATC

Centre used as a case study. This is achieved in several steps. Firstly, it is necessary

to assess all candidate RIFs and identify those relevant to a particular ATC Centre.

Secondly, the probabilities for each RIF (and its corresponding levels) are defined

based on the controllers input during the debriefing sessions. Thirdly, RIF interactions

are assessed and incorporated. Finally, the recovery context indicator is calculated as

a numerical representation of the context surrounding the simulated FDPS failure and

the subsequent controller recovery. These steps are presented in detail in the following

paragraphs.

10.3.1.1 Assessment of relevant RIFs

This step consists of the assessment of the 20 candidate RIFs and their relevance to

the experiment and the particular ATC Centre involved. Of these RIFs, ‘adequacy of

alarm’ and ‘adequacy of alarm onset’ are not relevant since there was no alarm/alert in

the design of the experiment (see Table 9-7, Chapter 9). There are two reasons for

1 Statistical tests investigated the null hypothesis for experimental sessions 1 and 2, 1 and 3,

and 2 and 3, separately.


275

this. Firstly, the experiment in this research is designed to capture controller recovery

unaided by system tools, and emphasis is placed on controller readiness to detect and

react to an unexpected failure. Secondly, past research have already shown that in

most cases the existence of an alert does have a significant impact on recovery

performance (Kaarstad and Ludvigsen, 2002; Theis and Straeter, 2001). As a result, 18

RIFs were determined to be relevant to this experiment.

10.3.1.2 Probabilities of each RIF and the corresponding levels

Based on data collected during the post-experiment debriefing session it was possible

to derive probabilities of each RIF and its corresponding levels. The results for all 18

RIFs are presented in Appendix XIV. Furthermore, these probabilities are used to verify

the RIF probabilities defined in Chapter 8 using the verification criteria (Table 10-2). In

other words, a set of expectations was defined before comparing the RIFs probabilities

derived for a ‘generic’ ATC Centre (Chapter 8) and a particular ATC Centre (used in

the experiment).

Table 10-2 Verification of RIFs probabilities from a ‘generic’ approach (Chapter 8) and the experiment

RIF groups Verification

criteria Result Comment

Internal No

difference

No difference, except ‘Communication for

recovery’

The controllers who participated in the experiment rated their communication mostly as ‘tolerable’, compared to the ATM specialists who rated it mostly as ‘efficient’. The experience with an equipment failure in the simulated environment may have indicated some shortcomings in the communication for recovery to participating controllers, of which ATM specialists were not aware of.

Equipment-related

No difference

No difference Note that the five out of six RIFs in this group have been controlled in the experimental design.

External Potential

for difference

No difference, except ‘Adequacy of organisation’

The controllers who participated in the experiment rated the organisation in their ATC Centre mostly ‘tolerable’ while the overall rating from ATM specialists was mostly ‘efficient’. This is a result of the local ATC Centre characteristics masked within more generic characteristics captured by eight ATM specialists.

Airspace-related

Potential for

difference

Difference is observed with ‘traffic

complexity’ and ‘overall task complexity’

This is expected as the experimental design planned for high traffic levels and overall task complexity (resulting from the simulated equipment failure)

The expected differences in RIF probabilities are a result of the experimental design

(e.g. traffic complexity and task complexity) and the overall difference in the


276

populations sampled (i.e. various ATC Centres sampled in Chapter 8 compared to the

ATC Centre sampled in the experiment). In short, the comparison of RIFs probabilities

for a ‘generic’ and a particular ATC Centre shows similarity.

10.3.1.3 Interactions between RIFs

This step consisted of an assessment and subsequent incorporation of interactions

between identified RIFs, as presented in Table 8-5 (Chapter 8). Based on the

methodology for the quantification of RIFs interactions developed in section 8.4.3 of

Chapter 8, it is possible to determine the coefficient of interaction for the interactions

between 18 relevant RIFs. This coefficient is k=1/(N-1)=1/17=0.059 (where N

represents the total number of relevant RIFs).

10.3.1.4 Recovery context indicator (Ic)

This particular study investigated 18 relevant RIFs, where six RIFs are defined via

three levels of impact and six RIFs via two levels of impact (according to qualitative

descriptors defined in Chapter 7, section 7.3). The remaining six RIFs are defined

through only one level, either because factors were controlled in the experiment or the

participants gave identical answers. For details see Table 10-3 and Chapter 9. In total,

this approach generates 36x 26 = 46,656 possible contexts, each defined through the

corresponding recovery context indicator.


277

Table 10-3 Summary of RIFs defined through a single corresponding level

Recovery Influencing Factor

(RIF) Descriptor Probability Level Comment


Multiple systems affected

1 3 Simulated Flight Data Processing System (FDPS) failure affects multiple systems


Sudden failure 1 1 The FDPS failure is simulated as a sudden failure


All workstations

1 3 The FDPS failure is simulated to affect the entire ATC Centre


Inappropriate 1 3

The objective of the experimental investigation was to simulate failure without recovery procedure

Duration of failure Short period of

time 1 2

The FDPS failure is simulated to last long enough to capture all phases of the recovery


External working

environment matches the controller’s

internal mental model

1 1

The controllers responded positively to the question on match between external environment and internal mental model, although they could not say that this match was one hundred percent.

After the calculation of all 46,656 possible contexts it was determined that the mean

value of the Ic is 0.029, ranging from -0.088 to 0.121. The distribution of the recovery

contexts is presented in Figure 10-4. Based on the shape of the Ic distribution, the data

has been fitted with two normal distributions. The result of this fitting is presented in

Appendix XV.

0

100

200

300

400

500

600

700

800

-0.088

-0.078

-0.068

-0.058

-0.048

-0.038

-0.028

-0.018

-0.008

0.00

2

0.01

2

0.02

2

0.03

2

0.04

2

0.05

2

0.06

2

0.07

2

0.08

2

0.09

2

0.10

2

0.11

2


Fre

qu

en

cy

Figure 10-4 Distribution of the recovery context indicator in the experiment


278

Using the experimental results, the distribution of the Ic derived in Chapter 8 is

assessed using the verification criteria (Table 10-4). In other words, a set of

expectations was defined before comparing the distribution of Ic for a ‘generic’ ATC

Centre (Chapter 8) and a particular ATC Centre used in the experiment.

Table 10-4 Verification of the distribution of the recovery context indicator obtained from a ‘generic’ approach (Chapter 8) and the experiment

Recovery context

indicator (Ic)

Verification criteria

Result Comment

Ic

Shape

Potential for difference as a result of the local characteristics of a

particular ATC Centre as compared to a ‘generic’ ATC Centre

Shape: the difference is observed with the left tail of the distribution

Mean Mean: similar2

Median Median: similar3

Range Range: similar4

The main difference observed is the shape of the distribution in the left tail. This cannot

be explained by the difference in the RIF probabilities as the previous section showed

that they differed for only two RIFs, as a result of the characteristics of the experimental

design. Therefore, it is assumed that the shape of the left tail resulted from the local

characteristics of the ATC Centre used in the experiment (Figure 10-4). Although these

characteristics may have existed in the distribution of Ic obtained from a ‘generic’ ATC

Centre (Chapter 8), they may be masked by a ‘generic’ approach.

Therefore, the cause of the deviation in the left tail may be the incorporation of a single

coefficient of interaction between all RIFs, as discussed in section 8.4.3 of Chapter 8.

Although it is known from the operational experience that the RIF interactions do not

have the same level of influence, this thesis had to define a more generic approach to

account for the lack of operational data.

The assumption that a change in the shape of the Ic distribution (in the left tail) is a

result of a single value of the coefficient of interaction, no longer capable of properly

2 A mean value of Ic for a ‘generic ATC Centre is 0.027, whilst for the ATC Centre used in the

experiment is 0.029. 3 A median value of Ic for a ‘generic ATC Centre is -0.023, whilst for the ATC Centre used in the

experiment is -0.026. 4 A range of Ic values for a ‘generic ATC Centre is from -0.069 to 0.131, whilst for the ATC

Centre used in the experiment is from -0.088 to 0.121.


279

accounting for local characteristic is further assessed on the example of the RIF

‘Adequacy of HMI and operational support’. This RIF is chosen because the interaction

matrix (Table 8-26, Chapter 8) indicates that this RIF impacts on several other RIFs.

Thus the change of its coefficient of interaction may have a significant impact on the Ic

distribution. As a result, the coefficient of interaction relevant to this RIF is increased

from the previous value of k=1/(N-1)=1/17=0.059 (section 10.3.1.3) by factor 10 to the

new value of k=10/(N-1)=10/17=0.59. The resulting distribution of Ic, presented in

Figure 10-5, shows the notable change in the shape of the left tail.

0

100

200

300

400

500

600

700

800

-0.088

-0.076

-0.064

-0.052

-0.04

-0.028

-0.016

-0.004

0.00

80.

02

0.03

2

0.04

4

0.05

6

0.06

80.

08

0.09

2


Fre

qu

en

cy

Figure 10-5 Distribution of the recovery context indicator in the experiment with an increased value of the coefficient of interaction

In short, the comparison of the distribution of Ic obtained from a ‘generic’ ATC Centre

and from the particular ATC Centre shows no difference in the mean, median, and

range, but only in the shape of the left tail. This difference in the shape has been

explained by the inadequate definition of the coefficient of the interaction. As previously

discussed in Chapter 8, more accurate definition of this coefficient will be possible once

a detailed database of human performance becomes available in the ATM industry.

While the controller’s responses gave a basis for the definition of the recovery context

indicator (Ic) through each possible recovery context, it was also possible to define

indicators for each controller. In several cases, the participants were not able to select

the corresponding level for several RIFs. For example, in the case of the RIF ‘weather

conditions during the recovery process’ several controllers were so preoccupied with

the recovery process that they did not pay any attention to the weather conditions.

Therefore, they were unable to select the appropriate level for this RIF. The missing

responses were informed by those available for this RIF. In other words, the missing


280

responses were replaced with the answer ‘unchanged’ (corresponding to Level 2)

reported by the majority of controllers. This is also in line with the actual design of the

experiment, where similar weather conditions were presented to the controllers in the

pre- and post-failure period. A similar approach is applied for other missing answers.

Figure 10-6 shows the distribution of recovery contexts for 30 controllers. All values of

the Ic are positive and range between 0 and 0.1. This reflects average or tolerable

environment (values of Ic are close to 0) that has a potential for improvement to

facilitate better recovery from equipment failure.

Figure 10-6 Distribution of the recovery context indicator of 30 controllers

After the assessment of recovery contexts surrounding each controller, the next section

reviews the potential solutions to enhance the recovery context (and thus controller

recovery) using the methodology developed in Chapter 8. In other words, the next

section analyses the sensitivity of the Ic to changes in RIFs.

10.3.1.5 Optimal solutions

In searching for the areas for potential enhancement to improve the controller’s

recovery process, it is necessary to focus on RIFs which may be affected at the level of

the ATC Centre. Table 10-5 presents the nine RIFs that could be enhanced, based on

the responses of the controllers who participated in the experiment and the

characteristics of the ATC Centre investigated.


281

Table 10-5 A review of RIFs with the potential for recovery enhancement

RIFs Potential for improvement

Internal RIFs

Training for recovery Previous experience Experience with system performance Personal factors Communication for recovery

√ - - √ √

Equipment failure related RIFs

Complexity of failure type Time course of failure development Number of workstations affected Time necessary to recover Existence of recovery procedure Duration of failure

- - - √ √ -

External RIFs

Adequacy of HMI Ambiguity of information Adequacy of organisation

√ - √

Airspace related RIFs

Traffic complexity Airspace characteristics Weather conditions Task complexity

- √ - √

It is important to note that the remaining RIFs are not taken into account for several

reasons. Firstly, in the particular experiment, a number of RIFs attained their most

favourable levels. In such cases, the majority of controllers expressed satisfaction with

the ATC system and expressed no desire for improvement of the particular RIFs.

Furthermore, several RIFs were controlled in the experiment and as such cannot be

changed. These are: complexity of failure type, time course of failure development,

number of workstations affected, and duration of failure. Finally, certain RIFs are simply

not possible to change, such as weather, experience with a particular type of

equipment failure, whilst traffic complexity cannot be influenced at the level of the ATC

Centre. This resulted in total of nine RIFs that have the potential to enhance the

recovery context and thus controller recovery performance (Table 10-4). The next

section illustrates how the improvement of one RIF (‘existence of the recovery

procedure’) could influence the recovery context.

10.3.1.5.1 Impact of enhancing ‘recovery procedure’ on recovery context

As the participating ATC Centre does not have a recovery procedure for FDPS failure

in place, this factor is chosen as the most practical and effective way of supporting


282

controllers and enhancing their recovery performance5. Assuming that the

management at the ATC Centre implements recovery procedures for FDPS failure, the

‘existence of recovery procedure’ RIF would be enhanced from Level 3 to Level 1 and

thus defined as ‘suitable to the situation in question’ (the probability of Level 1 equals

1.00; Table 10-6). This approach also assumes that all other RIFs remain unchanged

and that any potential impact of this change on other RIFs will be reflected through

identified RIF interactions.

The resulting recovery context would take the mean value of 0.091 (SD=0.0398; Table

10-6). The difference in the distribution of the Ic with and without change in the

recovery procedures has been tested using the non-parametric Mann-Whitney test

(presented in Chapter 6, section 6.7.4). Overall, the baseline recovery context differs

significantly from the recovery context which incorporated the proposed enhancement.

This means that the design of an appropriate recovery procedure significantly

enhances the recovery context and thus creates a better environment for controller

recovery.

Table 10-6 A review of the proposed recovery solutions

Potential RIF for change

Initial level

Ic

(M, SD, SE)

Level after

iteration

Ic

(M, SD, SE)

Statistical significance with 95% confidence

interval

Existence of recovery

procedure

0 M=0.029

SD=0.036

1 M=0.091

SD=0.039 p<0.001

Sig (U=3E08, z=-196.2) 0 0 1 0

It has to be noted that the proposed change in the recovery procedure represents only

one possible form of recovery context enhancement. In reality, one ATC Centre may

undertake several other solutions to enhance controller recovery. Furthermore, the

proposed change assumes the definition of the recovery procedure for a particular

equipment failure. Therefore, the calculated recovery context indicator is valid for this

failure type only and it would have to be recalculated for other failure types.

This approach may be used to rate the significance of each proposed change and

compare it with their related cost. However, the evaluation of the related costs, as

opposed to the benefit, is not so straightforward and would necessitate an input from

5 The only available procedures in this ATC Centre are those defined by ICAO. As previously

discussed in Chapter 5, ICAO does not define recovery practice for the FDPS failure.


283

the specific ATC Centre. Therefore, another approach presented in Chapter 8 may be

utilised to ‘rate’ the benefit of implemented changes by the calculation of the ‘recovery

context efficiency’. The ratio between the value of the current recovery context (mean

value of 0.04; Figure 10-5) and the value of the most positive recovery context feasible

in the particular ATC Centre (i.e. Ic=0.44) indicates that a ten fold improvement is

needed to achieve the most positive value of Ic.

The next section analyses the recovery steps taken by the controllers and their overall

recovery effectiveness.

10.3.2 Required recovery steps

The recovery performance of each participant has been compared to the pre-

determined set of required recovery steps. Figure 10-7 presents the ratio of recovery

steps performed by each participant to the total number of steps, whilst Figure 10-8

presents the distribution of recovery steps carried out. Only three out of 17 steps were

performed by all participating controllers. These are detection of the problem, location

of traffic, and identification of failure type6.

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Participants

Perc

en

tag

e o

f re

co

very

ste

ps p

erf

orm

ed

Steps not performed

Steps performed

Figure 10-7 Recovery steps performed by each participant

6 Note that if a controller did not seek failure-related information from the coordinator, the

coordinator was advised to inform the controller but only after the controller detected the failure. As a result, the occurrence of this step is inevitable.


284

0

5

10

15

20

25

30

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17

Required recovery steps

No

. o

f p

art

icip

an

ts

Figure 10-8 Distribution of required recovery steps (S1 to S17)

Further data analysis shows that on average each controller performed 74.2 percent of

the required recovery steps, ranging from as low as 29 percent to 100 percent. The

most neglected steps were the re-identification of all traffic (S14) and confirmation of

Mode C (i.e. confirmation of the accuracy of the post restoration FDPS data – S15).

The post restoration recovery steps of re-identifying traffic and validating Mode C are

important as these steps are considered best practice to ensure system safety in the

aftermath of an FDPS failure. The re-identification process is necessary for two

reasons. Firstly, the identification of traffic is lost whilst aircraft occupy a holding

pattern. Separation in a holding pattern is purely procedural and radar separation does

not apply. Secondly, because of the potential for label swapping and garbling of radar

signals when aircraft are in close lateral proximity (i.e. such as in a holding pattern).

Further investigation of the percentage of the steps performed in three sessions

reveals a significant difference between the first and the third session. The percentage

of the steps carried out in the first session is significantly lower than in the third

session. The relevant statistics are presented in Table 10-7. The percentage of the

performed recovery steps in the first experimental session is on average 64 percent,

increasing in the second experimental session to 77 percent, reaching 82 percent in

the third experimental session (Table 10-7).


285

Table 10-7 Percentage of performed recovery steps in three experimental sessions

Session Statistics Paired sessions Non-parametric Mann-Whitney test results

1 M=63.98

SD=21.69 1 and 2 p>0.05

2 M=77.06

SD=17.64 1 and 3

p=0.044 Sig (U=23.5, z=-2.0)

3 M=81.77

SD=12.84 2 and 3 p>0.05

After the last experimental session, it was suspected that certain changes had been

implemented in the training of controllers in the participating ATC Centre. The

debriefing session with controllers participating in the third experimental session and

the input from management revealed the incorporation of a compulsory emergency

training module within every rating conversion and continuation training course. This

change was firstly incorporated in the SID/STAR training that started on May 2006. As

a result, several controllers participating in the third experimental session (taking place

in June 2006) benefited from this change. It seems that that this change in training

syllabus led to the increased number of recovery steps performed and the significant

difference observed when compared to the first experimental session.

Statistical tests performed to determine the relationship between the percentage of

recovery steps performed and 18 RIFs, showed that only RIF2 (‘previous experience

with equipment failures’) has a statistically significant correlation. More precisely, the

negative correlation identified (r=-0.31) indicates that controllers who have experienced

equipment failures tend to perform more of the required recovery steps compared to

those who have not experienced failure. In other words, experience with equipment

failures enhances the controllers’ ability to recover. This finding should be transferred

into the training syllabus of every ATC Centre.

10.3.3 Recovery effectiveness

As explained in the previous Chapter, this variable is based on data and information

from three different sources, where each controller is categorised as follows: very good

(VG), good (G), adequate (A), partially adequate (PA), and inadequate (I). The

recovery performance of 43 percent of controllers is rated as partially adequate or

totally inadequate (Figure 10-9). These controllers did not assure ATC system

protection from possible further equipment degradation and did not employ timely and

accurate strip marking and strip board management. Therefore, they had little or no

means of supporting their mental picture of traffic and airspace. The post-restoration


286

steps were performed only to some basic extent without any proper check of the new

data accuracy. In addition, such a high percentage of inadequate performance

indicates that there is room for improvement throughout the ATC Centre participating in

this experimental investigation. The management of the ATC Centre should implement

solutions to assure a more efficient handling of unusual/emergency situations. Such

solutions could include emergency training on equipment failures, design of recovery

procedures, and regular briefings.

Figure 10-9 Distribution of recovery effectiveness per category (presented via frequencies and relative percentages)

Comparison of the recovery effectiveness for the three experimental sessions does not

reveal any significant differences (using the non-parametric Mann-Whitney test). In

spite of the implemented change in the participating ATC Centre (i.e. compulsory

emergency training module within the SID/STAR conversion training) and the increase

in the number of recovery steps performed, the effectiveness of the recovery

performance did not differ from one session to the other. This finding confirms that the

rating of recovery effectiveness does not depend on a simple count of recovery steps

performed. This finding further justifies the use of pooled data from all three

experimental sessions. It is an indication of the overall objective achieved with the

execution of those steps but without account of the time frame (recovery duration)

within which the objective is achieved. The combined effect of recovery effectiveness

and recovery duration is assessed in section 10.3.5.


287

10.3.4 Recovery duration

The recovery duration is the time measured from the controller’s first action to the end

of the recovery process. During the experiment the first action was identified by the

observation and video recording of each controller’s performance, further validated with

the controller (during the post-experiment debriefing session) and the SME. For

example, the time of the first action was the moment when a controller initiated a

search for the uncorrelated track(s), contacted Area Control Centre (ACC) to check on

the uncorrelated track(s) or contacted aircraft to ask for a transponder check (using the

phraseology “squawk ident”). The end of the recovery process in this particular

experimental design was influenced by the restoration of the failed system and the

performance of the necessary post-restoration steps.

In general, the recovery duration ranged between 12:08 and 15:49 minutes, with an

average duration of 14:38 minutes (SD=0:55). The distribution of the recovery duration

of all 30 controllers per four duration categories is presented in Figure 10-10. These

categories are: 12-13, 13-14, 14-15, and 15-16 minutes. Figure 10-10 shows that 50

percent of controllers initiated the first recovery action within the first minute of the

failure occurrence (and thus their recovery duration lasted between 15 and 16

minutes). The shortest recovery duration is captured in the recovery performance of

two controllers (6.7 percent; Figure 10-10). These two controllers, although initiating

recovery later than the others, implemented an excellent recovery strategy. This finding

highlights that the recovery duration and recovery effectiveness alone are not

appropriate indicators of the overall recovery outcome. To enable a safety assessment

of the recovery performance it is necessary to account for both, as presented in section

10.3.5.


288

Figure 10-10 Distribution of recovery duration

Comparison of the recovery duration for the three experimental sessions revealed

significant differences. More precisely, the recovery duration in the third experimental

session is significantly longer than in the first two sessions (Table 10-8). This is a result

of the controllers from the third session reacting to the identified failure more promptly

compared to the controllers from the previous two sessions. This may be the result of

the change in the training implemented by the management in the participating ATC

Centre prior to the third session. However, it has to be noted that more prompt reaction

to the identified failure (i.e. longer recovery duration) does not necessarily entail an

effective recovery.

Table 10-8 Comparison of recovery durations between three experimental sessions

Session Statistics Paired sessions Non-parametric Mann-Whitney test results

1 M=14:15 SD=1:02

1 and 2 p>0.05

2 M=14:25 SD=0:58

1 and 3 p=0.031 Sig (U=21.5, z=-2.2)

3 M=15:14 SD=0:18

2 and 3 p=0.014 Sig (U=17.5, z=-2.5)

Non-parametric Kendall’s tau tests performed between recovery duration and various

RIFs, reveal four statistically significant correlations. These are presented in Table 10-9

while the details of this test are discussed in Chapter 6. Firstly, the analysis shows that


289

the recovery duration tends to be longer7 if the last emergency training had a module

on equipment failures. This finding indicates the benefit that emergency training has on

recovery duration (as it prepares controllers to react rapidly to an emergency situation).

Secondly, a similar effect on recovery duration is seen with enhanced communication

for recovery. In other words, if the controllers initiate recovery sooner, they have more

time to adequately communicate the problem to team members or a supervisor.

Thirdly, the existence of adequate recovery procedures promotes prompt recovery

action. This is in line with the finding of the first test. Finally, recovery duration

increases with a decrease in traffic complexity. This is expected as the less demanding

traffic situation allows more prompt action and initiation of the first recovery action

sooner rather than later.

Table 10-9 Statistical tests and results

Variable 1 Variable 2 Test Statistical significance at

95% confidence level

Recovery duration

Last emergency training (module on equipment failure)

The nonparametric correlation

(Kendall’s tau)

p=0.018 (r=-0.39)

Communication for recovery p=0.10 (r=-0.39)

Existence of the recovery procedure

8

p=0.15 (r=-0.41)

Traffic complexity p=0.004 (r=-0.46)

After assessing both recovery effectiveness and recovery duration, it is realised that

independently they are not appropriate indicators of the recovery outcome, as

discussed in Chapter 5. Therefore, a safety assessment of the overall recovery

performance necessitates the use of both variables combined into the ‘outcome of the

recovery process’ presented in the following section.

10.3.5 Outcome of the recovery process

The outcome of the recovery process represents the final stage in technical and

controller recovery as previously discussed in section 5.3 of Chapter 5. Since no

technical recovery was taken into account in this experiment, the outcome of the

7 More prompt first recovery action by a controller is representative of the longer recovery

duration. 8 There is no recovery procedure for the simulated equipment failure in the participating ATC

Centre, but some controllers stated that they had experienced similar failures as part of their initial simulator training. Discussion with the subject matter expert revealed that this particular equipment failure is not simulated in any training syllabus.


290

recovery process focuses solely on the outcome of controller recovery. This is defined

as a combination of two recovery variables. Firstly, recovery effectiveness that

accounts for recovery steps carried out by a controller and achievement of the three

key objectives (i.e. ATC system protection, maintenance of situational awareness, and

adequate post-restoration steps). Secondly, recovery duration accounts for the time

frame in which these steps were performed. In line with the discussion in Chapter 5,

the outcome of the recovery process is accounts for successful and unsuccessful

recovery. An additional category for ‘tolerable’ recovery outcome is also defined in this

thesis (Table 10-10).

Table 10-10 The outcome of the recovery process matrix applicable to the experimental set up presented in this thesis (S stands for successful, T for tolerable, and U for unsuccessful recovery)

Recovery duration (minutes)

12-13 13-14 14-15 15-16

Reco

very

E

ffe

ctiven

ess Very good T T S S

Good T T T S

Adequate U T T T

Partially adequate U U T T

Totally inadequate U U U T

The recovery outcome matrix highlights that successful recovery requires the initiation

of the recovery process within the first two minutes from the instant of the failure

occurrence and the performance of the majority of the recovery steps (assuring

achievement of all three objectives). An unsuccessful recovery is a result of a controller

failing to achieve two or more key objectives while initiating the recovery after more

than one minute from the instant of the failure occurrence. The delayed first recovery

action leaves the ATC system completely unprotected. Therefore, the temporal

requirements for the unsuccessful recovery account for three categories of the

recovery duration variable (Table 10-10). Everything outside the scope of the

successful and unsuccessful recovery is considered tolerable. The above discussions

are only applicable to this experimental time frame and setting, and are extracted

based on operational experience, with a further validation by the SME.

Based on the presented categorisation, the outcome of the recovery process for

controllers who participated in the experiment is mostly tolerable (Figure 10-11). This

finding again confirms that there is room for improvement of the recovery performance

in the ATC Centre used in this experiment.


291

Figure 10-11 Distribution of the recovery outcome

After assessing all recovery variables, the next section identifies any relevant

interactions between them.

10.3.6 Interactions

This section investigates the level of interactions between the recovery variables using

statistical testing (previously discussed in Chapter 6). Table 10-11 presents the results.

Table 10-11 Statistical tests and results

Variable 1 Variable 2 Test Statistical significance at 95 percent confidence interval

Recovery context indicator


Non-parametric

test (Kendall’s tau)

p=0.06, r=0.329


p=0.017, r=-0.36



p=0.01, r=0.57

Recovery duration Outcome of the

recovery process p>0.05

Non-parametric Kendal’s tau statistical tests indicated three significant relationships

(Table 10-11). Firstly, a statistical test indicates a relationship between recovery

effectiveness and recovery context indicator at the 90 percent confidence level

(p=0.06, r=0.32). Furthermore, the Mann-Whitney non-parametric test shows the

9 Statistical significance at the 90 percent confidence interval


292

relationship between recovery context indicator for the combined category of ‘very

good’ and ‘good’ recovery effectiveness on one side and ‘partially adequate’ and ‘totally

inadequate’ on the other (at the 90 percent confidence interval, p=0.065). Secondly, a

statistical test indicates a significant relationship between the recovery context indicator

and the outcome of the recovery process at the 95 percent significance level (p=0.017,

r=-0.36). In other words, the higher values of the recovery context indicator enhance

the outcome of the recovery process or the recovery success. Finally, a statistical test

indicates a significant relationship between recovery effectiveness and the outcome of

the recovery process. In other words, the greater controller recovery effectiveness the

more successful is the overall recovery. All findings are in line with the operational

experience.

10.3.7 Other findings

In addition to the findings above, the following points are worthy of note. These are

presented, firstly by considering the phases of recovery and the corresponding

influencing factors. Secondly, by considering the behaviour and attitude of the

controllers, as the simulated failure was unexpected. Finally, additional findings related

to controller recovery of relevance to the management of the particular ATC Centre and

the wider aviation community are presented also.

10.3.7.1 The recovery phases

The following paragraphs provide a review of the three distinct recovery phases as

explained in Chapter 5, section 5.2. This review focuses on the factors that influenced

controller recovery performance in each phase.

10.3.7.1.1 Detection

In the simulated runs, detection, or recognition that there is something unusual in the

ATC system, was determined by several factors. The most prominent factor was the

pilot's first contact with ATC. There were two flights entering the approach sector

simultaneously following failure injection. Depending on the pseudo-pilots’ workload,

either of these aircraft could contact the controller first. At the moment of the first

contact the flights were still outside of the controller’s area of responsibility (some

40Nm away from the airport10) and controllers were sufficiently busy in the vicinity of

the airport providing approach control service. As a result, the aircraft were usually

10

Note that the display range in this experiment was set to 30Nm for each controller.


293

asked to standby for radar identification. In the case of late contact by the first

uncorrelated track (once the track is almost visible on the radar screen or at about

35Nm from the airport), controllers searched for the track and detection of the problem

was then immediate. The common factors that influenced the detection phase of the

recovery process in this experiment were determined based on observations, video

recordings, and debriefings. These are as follows:

� The first radio contact (RT) of uncorrelated track;

� Traffic complexity and related level of controller workload at the moment of contact;

� Display range (set at 30Nm for this experiment);

� Type of the equipment failure (uncorrelated tracks were immediately visible on the

screen once within radar range); and

� Complexity of failure type (affecting single or multiple equipment simultaneously).

It should be noted that the same set of factors also affected the instant of the first

recovery action. The reason is that detection is a prerequisite for the first recovery

action.

10.3.7.1.2 Diagnosis

In this experiment, after the detection of one uncorrelated track, the controller’s first

assumption was usually aircraft transponder failure. This prompted a request to the

pilot to squawk identification on the secondary transponder (i.e. to operate the

designated Mode A code on the primary/secondary transponder). When this check did

not produce a correlated track on the radar screen further checks were necessary. At

this stage, the second aircraft was usually well inside the radar display range also in an

uncorrelated state. At this point, it became obvious to the controllers that they were

experiencing some form of equipment failure and they sought information from the ATC

Centre coordinator as to the nature of the failure. The possible options were failure of

secondary surveillance radar or FDPS failure. SSR failure was discounted as soon as

the mix of correlated and uncorrelated tracks was visible. The final option was FDPS.

The coordinator was instructed to announce that it was FDPS failure affecting the

entire ATC Centre. Moreover, he also emphasised that flight plan tracks would remain

correlated only for tracks already displayed, while all other tracks entering the system

will appear uncorrelated. The common factors that influenced the diagnosis stage of

the recovery process in this experiment were determined based on observations, video

recordings, and debriefings. These are as follows:

� The number of uncorrelated tracks observed on the radar display;

� Input by the coordinator;


294

� Type of equipment failure; and

� Complexity of failure type.

10.3.7.1.3 Correction

In the exercised traffic scenario, the correction phase consisted of the identification of

all traffic using an appropriate primary radar technique. There are a number of

available techniques to identify traffic. Those chosen by the controllers in this

experiment were confirmation of bearing/distance of the aircraft from a fix and the turn

method (turning a singe aircraft by 30 degrees or more to ascertain positive radar

identification). Operationally, the bearing/range technique is considered to be more

effective and expeditious, as it avoids misidentification due to simultaneous turning of

more than one aircraft. The next step in this process would be to inform all traffic of the

exact nature of the equipment failure and to advise them of possible consequences

(i.e. restrictions and delays). This would be followed by restricting any sport/training or

non-commercial aircraft, refusing departures permission to depart, and utilising the

holding pattern for all arrivals. If the failure was persistent (in this experiment it lasted

15 minutes), the controllers had to think of the steps to assure system safety in the

case of further deterioration of the equipment reliability. Thus, they had to provide

vertical separation and preserve the highest level of situational awareness. This should

be achieved by maintaining accurate and timely strip marking and strip board

management11. The common factors that influenced the correction stage of the

recovery process were determined based on observations, video recordings, and

debriefings. These are as follows:

� Traffic complexity;

� Existence and familiarity with the recovery procedure(s);

� Duration of failure;

� Type of equipment failure; and

� Complexity of failure type.

Figure 10-12 links the key characteristics of each recovery phase in this particular

experiment with the recovery steps relevant for each phase.

11

The debriefing sessions investigated the overall quality of strip management and annotation without going into a more detailed analysis. In future, the structure of the debriefing session may place more emphasis on this segment of the recovery process.


295

Figure 10-12 Recovery phases, their corresponding influencing factors and required recovery steps

10.3.7.2 Observed behaviour and attitude

As discussed in Chapter 9, all the observations of the controllers’ attitude and

behaviour were captured by the assistant. A check-list using the SHAPE’s list of

attitudes was used as an initial tool and guidance to the assistant in performing this

task (see EUROCONTROL, 2004f). In addition, some of the observations were

captured during the debriefing sessions.

In general, the observations in the first two experimental sessions show a difference in

overt behaviour in the pre- and post-failure segment of the experimental investigation.

In line with the results obtained with other recovery variables, the analysis of the

relevant data on controllers participating in the third session did not reveal significant

changes in overt behaviour in the pre- and post-failure segment of the experiment.

Furthermore, the findings from the first two sessions are in line with the previous

findings on the consequences of stress on individual controllers (Costa, 1995). Whilst

for some controllers the overall posture remained the same throughout the exercise,


296

others displayed the complete opposite. The deviations from the pre-failure behaviour

involved the following:

� increased movement (i.e. overall posture, hands, feet, or head);

� forceful displacement of the strip holders;

� deviations from standard RT phraseology;

� hesitation in RT communication; and

� change in pitch or tone of voice.

The subject matter expert involved confirmed that most of these behavioural gestures

depict a typical reaction to a reduced mental picture of either the traffic or overall

situational awareness. Even during the debriefing stage of the experiment, the change

in the controllers’ behaviour was noticeable for the first two experimental sessions.

Examples include shaky voice, overall unease, high alertness, and seriousness. The

controllers who performed the recovery process at either tolerable or good levels were

noticeably more relaxed and talkative. On the other hand, the controllers who

performed at either partially adequate or inadequate levels were without exception

more nervous and reluctant to answer questions in detail, and carry out an objective

review of their own performance. The overall conclusion is that the equipment failure

was an unexpected event and contributed to a significant increase in the controller’s

workload (as reported subjectively by the participating controllers).

10.3.7.3 Additional findings

It is important to present all acquired findings as they represent important issues for the

management of the participating ATC Centre as well as the wider aviation community.

These are presented in the following paragraphs.

Although 73 percent of the controllers reported that their training was suitable to the

equipment (i.e. FDPS) failure and traffic scenario in question, analysis of data collected

in the experiment showed that for 43 percent (of the 73 percent) received the last

emergency training more than a year prior to the experiment12. From the controllers

who were able to recall, 50 percent stated that the emergency training session they

participated in had a module on equipment failures, predominantly on radar failures.

However, it was also noted that 40 percent of the controllers did not have any type of

equipment failure in their last emergency training. As a result, 93 percent of controllers

12

Note that 27 percent of controllers had their last emergency training in the month prior to this experiment, as a part of the approach rating course.


297

who participated in the experiment reported they would like to have more frequent

training for unusual situations. The most desired frequency of emergency training

sessions was every six months. This is in line with the findings obtained in the

questionnaire survey (Chapter 6) where 45 percent of controllers believe that recurrent

training once a year is not enough to develop and maintain the level of proficiency

required for recovery from equipment failures.

Interesting results were obtained on the question on the existence of a recovery

procedure for the simulated FDPS failure. Although the procedure for this kind of failure

does not exist in the Manual of Air Traffic Services (MATS), 20 percent of controllers

believed that this particular procedure does exist. Some of the controllers, who had

participated in the approach control course, quoted their training manual as the

reference for this procedure. However, no evidence was found to support their

statement. The best explanation for this is that these controllers identified Secondary

Surveillance Radar (SSR) failure with FDPS failure and relied on their recent radar

fallback training, without fully understanding what the implications of the loss of FDPS

are. The outcome of FDPS failure is significantly different from simple SSR failure, as it

represents a more serious failure that requires immediate attention from the controllers

with the required skills.

On the issue of Human Machine Interface (HMI) and operational support (e.g. auxiliary

display, communication panel) 46.7 percent of controllers found the Beginning to End

Skills Trainer (BEST) simulator platform suitable to the equipment failure and traffic

scenario in question, 36.7 percent found it tolerable, while ten percent found it counter

productive. 6.7 percent of the controllers did not respond to this question. However,

most of the controllers stated that the BEST platform’s HMI is not as good as the HMI

used in the operational centre. There are two reasons for this. Firstly, meteorological

data needs better positioning (i.e. closer to the screen) to avoid head turn and change

of visual field and secondly, a lack of alert or warning that a failure has occurred (i.e.

colour change to yellow or red in the ‘general information window’).

Several organisational issues were raised during the debrief sessions. The most

frequent issues raised were that controllers:

� felt that supervisors should receive more dedicated training in the handling of

unusual occurrences and system failures. Their role in coordinating recovery

actions should be more proactive. In addition, it was highlighted that coordination


298

with technical services and adjacent ATC Centres should be the primary

responsibility of the supervisor during a Centre crisis;

� felt that more emphasis could be placed on developing an understanding of the

separate roles of both controllers and engineers. This perceived lack of

understanding of each peer group’s function and tasks can create communication

difficulties in the operational environment;

� identified a need for an update of the MATS with regard to the on suite task

allocation between the executive and planning controller. Additionally, controllers

stated that the last three incidents involving a loss of standard separation involved

team related issues that contributed to the events. Therefore, it is necessary to

strengthen the relationship between executive and planning controllers and to

define their precise roles and responsibilities;

� stated that their roles as currently defined in MATS are ideal but in reality are

difficult to adhere to, especially in a busy operational environment. They further

stated that in the event of an unusual occurrence, there are no guidelines available

for the handling of such situations;

� stated that competency checking, conducted once per year for only one hour, is not

sufficient. They also stated that the availability of refresher training in unusual

occurrences is also limited to once per year. One again, this finding is in line with

the questionnaire survey results presented in Chapter 6.

In general, the participating controllers rated their own performance between efficient

and tolerable (47 percent rated their own performance as efficient and 50 percent as

tolerable). This is not in accordance with the overall assessment of their performance

(recovery effectiveness) where 43 percent of the controllers performed at the ‘partially

adequate’ and ‘inadequate’ levels. This should pose some concern especially

considering that 46.7 percent of controllers stated that their performance in this study

was no different from any other day. In addition, 45 percent of them marked their

performance as highly representative of their overall ability to recover from an

equipment failure in ATC. Finally, 70 percent of controllers stated that the task they

experienced in the experiment was highly realistic.

Furthermore, 33 percent of the controllers stated that they were not aware of the

complete impacts/implications of a particular failure or equipment failures in general. As

a result, 87 percent of the controllers stated that they would like to have some form of

aide memoire available at each CWP to assist them in recognising the effects of a

particular equipment failure and steps to be taken to recover. As a consequence this


299

thesis proposes a framework for the establishment of an aide-memoire (in Appendix

III). A summary of all additional findings is presented in Table 10-12.

Table 10-12 Summary of additional findings

Variable Finding Comment

Training

73 percent reported that their training was suitable

Majority of these controllers had the last training on unusual situations more than a year ago. Only half of the respondent had an equipment failure.

93 percent of controllers would like more frequent training for unusual situations

Trust in ATC technology

93 percent of controllers have an objective attitude toward ATC equipment

Recovery procedure

20 percent of controllers believe that the procedure for FDPS failure exists

The procedure does not exist in the ATC Centre

HMI

46.7 percent of controllers found the BEST platform suitable to their needs and only 10 percent found it counter productive

Negative comments are mostly related to the differences between BEST platform and the system used in the operations room

Overall recovery performance

47 percent of controllers rated efficient 50 percent of controllers rated tolerable

Not is accordance with their overall performance. 43 percent of controllers were rated partially adequate or inadequate.

Awareness of the impact of a

particular failure

33 percent of controllers is not completely aware

Availability of aide memoire

87 percent of controllers is in favour A framework of aide memoire is provided in Appendix III

10.4 Summary

The Chapter set out to achieve several objectives. Firstly, it set out to verify a

methodology for the quantitative assessment of the recovery context (defined in

Chapter 8) and its operational benefits. Secondly, it set out to verify a framework for an

in depth analyses of controller recovery using recovery variables previously identified in

Chapter 5. The final objective set out to assess the outcome of the recovery process.

All these objectives have been achieved by the experiment and several interesting

findings have been produced. These are as follows:

� The majority of controllers tend to omit some critical recovery steps related to the

post-restoration phase. These are re-identification of traffic and confirmation of

the accuracy of information provided by the restored equipment. The sampled

controllers seemed to rely on the information provided without questioning its

accuracy following the occurrence of a failure.


300

� Controllers with prior experience of equipment failures tend to carry out more

recovery steps compared to those without prior experience. In other words,

experience with any equipment failure tends to enhance the controllers’ ability to

deal with equipment failures. Moreover, this type of stress-exposure training

enhances the stress-coping skills of controllers and as such should be

incorporated into the training syllabus of every ATC Centre.

� A high percentage of inadequate recovery performance indicates that there is

room for improvement throughout the ATC Centre participating in the experiment.

Hence, the ATC Centre management should implement solutions to assure

efficient handling of unusual/emergency situations. Note, however that the

management of the ATC Centre where the experiment took place implemented

an initial process to train controllers to deal with unusual/emergency situations.

This was in the form of a compulsory emergency training module within every

rating conversion and continuation training course.

� The first recovery action tends to occur more promptly if a controller has had

training for unusual/emergency situations.

� If the controllers initiate recovery sooner, they communicate better with team

members and the supervisor.

� The existence of adequate recovery procedures tends to promote prompt

recovery action.

� Recovery duration tends to increase with a decrease in traffic complexity. This is

expected as the less demanding traffic situation allows the controllers to initiate

recovery action sooner rather than later.

� The outcome of the recovery process variable has been defined as an overall

safety indicator of the recovery process. It represents a combination of the

recovery effectiveness and duration.

� The recovery context indicator represents a good indicator of both recovery

effectiveness and the outcome of the recovery process.

� Recovery duration itself is not a good indicator of the outcome of the recovery

process, whilst recovery effectiveness is.

� The framework for the analysis of controller recovery proposed in this thesis and

verified in the operational environment, shows a potential for an in depth analysis

of controller recovery from equipment failures in ATC.

Chapter 11 Conclusions

301

11 Conclusions

This Chapter presents the main findings of the research on controller recovery from

equipment failures in Air Traffic Control (ATC) and suggests avenues for future work.

The approach taken for the former is to address each of the research objectives

formulated in Chapter 1 (repeated below for ease of reference) and to present the

corresponding findings. The Chapter concludes with the identification of research

questions and ideas to be explored in future research.

11.1 Revisiting the research objectives

Chapter 1 defined a set of four research objectives for this thesis. These are to:

� Provide a systematic literature review to connect disparate but related topics of

ATC equipment failures and controller recovery, previously lacking in the area of

ATC;

� Identify potential equipment failure types and their characteristics;

� Identify contextual factors that affect controller recovery performance and derive a

methodology to quantitatively assess recovery context; and

� Propose a framework for the analysis of controller recovery. This framework should

be further verified with specific reference to a particular equipment failure type.

11.2 Conclusions

11.2.1 Literature review

The review of relevant literature aimed to connect ATC equipment failures with both

technical and air traffic controller recovery. With respect to the literature review, the

following conclusions are relevant:

1. The assessment of controller recovery from equipment failures in ATC has to

address technical and controller recovery together and not in isolation as has

been the case in the past. This holistic approach enables a complete

understanding of controller recovery and all of its influencing factors.


302

2. Because of the variety of equipment, components, and tools in both current and

future ATC system architectures, ATC equipment should be classified based on

the type of ATC functionality it supports. Such a functional classification is

flexible to changes in ATM/ATC and can capture both current and future

equipment failure types.

3. Recovery procedures, recovery training, and past experience with equipment

failures are the main drivers of controller recovery performance. However, the

provision of both recovery procedures and training is inconsistent, across ATC

Centres.

4. The context in which controller performance takes place has an important role

in controller recovery.

11.2.2 Equipment failure types and their characteristics

Equipment failure characteristics were determined from past research and operational

experience through the analysis of operational failure reports and responses from a

questionnaire survey of air traffic controllers. With respect to equipment failure

characteristics, the following conclusions are relevant:

5. The key characteristics of ATC equipment failure are: ATC functionality

affected, complexity of failure type, time course of failure development, duration

of failure, potential causes of equipment failure, and the consequences of

equipment failure.

6. Information on equipment failure characteristics has been used to develop a

novel qualitative equipment failure impact assessment tool. This tool enables

the identification of equipment failures that are most challenging to ATC

operations.

7. Communication, surveillance, and data processing ATC functionalities are

affected most by equipment failures and have the most severe impact on ATC

operations. This finding has been verified by operational failure reports and the

results of the questionnaire survey.

8. According to operational failure reports further verified with the results of the

questionnaire survey, equipment failures that have a major impact on ATC

operations mostly affect the air ground communication, radar surveillance

coverage, and the Flight Data Processing System (FDPS).

9. According to operational failure reports, the most frequent equipment failures

last up to 15 minutes. Furthermore, analysis of the reports has shown that the


303

longer the failure, the less severe it is. This finding is expected as more severe

failures are attended to immediately.

The conclusions listed above, resulting from the investigation of equipment failure

types and their characteristics in the operational ATC environment, have the potential

to impact policy formulation and the operational aspects of ATC/ATM. The thesis

findings have highlighted, for the first time, the ATC functionalities that are most

affected by equipment failures as well as those which have the most severe impact on

ATC operations. These use of the findings are twofold. Firstly, to identify the equipment

failure types mandatory for recovery training/procedures designed for an ATC Centre.

Secondly, the qualitative equipment failure impact assessment tool can be used as a

part of the incident investigation process as well as a design tool, supporting the design

of recovery training scenarios.

11.2.3 Controller recovery performance, recovery context, and influencing factors

The main findings related to controller recovery performance and the recovery context

are drawn from two sources of information. Firstly, the questionnaire survey results

provided an initial insight into controller recovery and relevant factors. Secondly, a

review of several Human Reliability Assessment (HRA) techniques identified a set of

relevant contextual factors, the so-called Recovery Influencing Factors (RIFs). With

respect to controller recovery and the overall recovery context, the following

conclusions are relevant:

10. This thesis presents for the first time, a comprehensive investigation of the

factors that influence controller recovery. This has been done through a

rigorous process that started with relevant past research, a questionnaire

survey, targeted experiments, and statistical analyses to develop a functional

relationship between controller recovery and its influencing factors.

11. The questionnaire survey showed that the majority of controllers experience

equipment failures annually.

12. Improvement in ATC Centre management is required to facilitate effective

recovery. This can be achieved through, for example organised exchange of

experience within ATC Centres, not only with respect to equipment failures but

also with all types of emergency/unusual situations. Statistical tests identified

that controllers’ account for exchange of information regarding equipment

failures as a type of past experience.


304

13. The questionnaire survey showed that the vast majority of ATC Centres

surveyed have some form of recovery procedure. The most neglected

procedures are for ATC functionalities which are most challenging to controller

recovery (data processing, surveillance, and communication functionalities). In

addition, controllers highlighted the need for an abbreviated version of the

contingency manual which should be made available at each controller working

position (i.e. aide-memoire).

14. Recovery procedures should be up-to-date, complete, and follow a logical

sequence of steps that the controllers should perform. In addition, recovery

procedures need to be compatible with other procedures within the ATC Centre.

In short, procedures should be seen as guidance to the controller, they should

be adaptable to any given situation, and should take account of a variety of

contextual factors.

15. Half of the ATC Centres surveyed in the questionnaire survey have

programmes for training in recovery from equipment failures. However, this

recurrent training is usually provided once a year. The controllers believe that

the frequency of recurrent training is inadequate and are in favour of receiving

as much training as possible on emergency/unusual situations, including

equipment failures.

16. Recurrent training must be up-to-date and compatible with other training

programmes. Moreover, the recurrent training exercises should be varied and

realistic covering both outages and less severe failures. The ATC Centre should

adopt a custom of periodically reverting to backup systems in order to maintain

controllers’ proficiency with their usage, perhaps during less busy traffic

periods.

17. Regular training on system functionalities, upgrades, and degradation modes

could be a useful method to ensure consistent knowledge and familiarity with

the ATC system architecture.

18. The majority of controllers surveyed confirmed the importance of context

surrounding an equipment failure occurrence. This confirmed the earlier finding

from existing research literature.

19. The context surrounding controller recovery from equipment failure in ATC is

defined via 20 contextual factors, known as Recovery Influencing Factors

(RIFs). Each RIF can be further defined via its qualitative descriptor. This

establishes the relationship between each RIF and its influence on controller

performance.


305

20. An aggregated indicator of the entire recovery context has been proposed,

referred to as recovery context indicator (Ic). This quantitative indicator of the

recovery context is sensitive to changes in the individual RIFs.

This thesis presents for the first time, a comprehensive set of the factors that influence

controller recovery (RIFs). These factors can be used as part of an incident

investigation process, enabling a detailed investigation of the impact of context on

controller recovery performance. The identification and assessment of RIFs can also

be used for the identification of recommendations on various aspects of ATC operation

and their refinement. However, the final decision of the optimal recommendation should

be based on the degree of positive shift in the value of the recovery context indicator

(as the quantitative indicator of the recovery context). Within the future ATM system,

this methodology could be easily modified to account for the shared responsibility of

separation of aircraft and collaborative decision-making between airborne and ground

based ATM system components.

11.2.4 Framework for the analysis of controller recovery

The framework for the analysis of controller recovery proposed in this thesis was

verified in an experimental investigation with specific reference to a particular

equipment failure type (i.e. FDPS) and a particular ATC Centre. With respect to the

framework for the analysis of controller recovery, the following conclusions are

relevant:

21. Recovery variables relevant to controller recovery from equipment failures in

ATC are the recovery context, effectiveness, and duration. This set of recovery

variables showed a potential for the rigorous analysis of controller recovery.

22. The experiment showed that the controllers with previous experience of

equipment failures executed more required recovery steps. Overall, experience

with equipment failures enhances a controller’s ability to deal with any type of

equipment failure.

23. A further finding from the experiment is that recovery duration tends to be

longer, the closer the emergency training with a module on equipment failures

is to the occurrence of the actual failure.

24. Communication with team members or the supervisor is enhanced when

controllers initiate recovery action sooner (i.e. as close as possible to the instant

of the occurrence of the failure).


306

25. Furthermore, the experiment showed that the existence of recovery procedures

(or any type of reference material, such as training manuals) promotes prompt

recovery action.

26. The experiment also showed that recovery duration increases with a decrease

in traffic complexity.

27. The recovery context indicator represents a good indicator of both recovery

effectiveness and the outcome of the recovery process (represented as a

combination of the recovery effectiveness and duration).

28. The thesis has identified a statistically significant correlation between recovery

context indicator and the outcome of the recovery process. Hence, the outcome

of the recovery process represents a good safety indicator of the overall

recovery process.

The relevance of recovery training (either as an alternative or an addition to past

experience) and recovery procedures has been confirmed by experiment. Recovery

training and awareness of recovery procedures lead to more prompt recovery action,

better awareness of required recovery steps, and enhanced team communication.

These findings should directly inform the required policy on training and procedures for

handling unusual/emergency situations, highlighting required content, frequency, and

format. Furthermore, the recovery variables identified (recovery context, effectiveness,

and duration) have the potential to facilitate a rigorous analysis of controller recovery

from equipment failures in ATC and thus can be used in incident investigation

processes. Finally, the recovery context indicator represents a good indicator of the

outcome of the recovery process (represented as a combination of the recovery

effectiveness and duration). As such, the overall framework for the analysis of

controller recovery based on identified recovery variables can be used to assess the

outcome of the recovery process in both current and future ATM environment.

11.3 Future work

The research presented in this thesis demonstrates the capability to assess ATC

equipment failures and subsequent controller recovery performance. However, these

findings also suggest a number of directions for further research. These include:

� It is hard to find safety related research in the aviation industry which does not rely

upon some type of occurrence data. However, seldom do any of them pose a

question about the reliability of the data available. To this date, no measure of

reliability of occurrence databases has been produced. Automatic tools exist in

certain countries, for example the Safety Monitoring Function (SMF), which


307

captures all losses of separation incidents in controlled airspace of that country.

Data from such a tool may provide an indication of the reliability of the occurrence

data.

� Future research should investigate ways to overcome the logistical difficulties with

capturing operational data and corresponding qualitative and quantitative aspects of

validation (e.g. in terms of questionnaire survey sample, number and characteristics

of ATM specialists, and subject matter experts).

� The further development of the qualitative equipment failure impact assessment

tool (Chapter 4) would be required to enable assessment of the impact of several

independent failures on ATC operations and thus controller performance. The

output of this more advanced approach would be to indicate the most severe

independent multiple failures. However, to achieve this, the tool would have to be

adapted to a specific ATC Centre to integrate the complexity of its ATC architecture

and the flow of data between various ATC systems.

� The questionnaire survey used in any future research should apply rigorous design

methods to avoid ambiguities and facilitate interpretation or perception of key terms

(e.g. equipment failure).

� The relationship between the particular RIF level and its impact on controller

recovery (i.e. defined via qualitative descriptor in Chapter 7 and the correlation

coefficient in Chapter 8) could be defined as a function of RIF level. This approach

would be more sensitive to the changes resulting from the incorporation of RIF

interactions.

� It would be necessary to simulate the impact of ATC equipment failures in a future

gate-to-gate ATM system where the roles for planning and executive control will be

reorganised and distributed between controllers and pilots. Additionally, this future

environment will be characterised with dynamic real-time exchange and distribution

of flight-related information. Thus, the safety assessments would have to consider

the exchange and distribution of corrupted data and its impact on both air and

ground services.

� The thesis has identified a statistically significant correlation between recovery

context indicator and the outcome of the recovery process. Future research should

transfer this finding into a model that could be used operationally in an ATC Centre.

11.4 Publications relating to this work

The following publications have been produced in support of the research on controller

recovery from equipment failures in ATC. The publications consist of journal


308

publications and published conference proceedings, each commented on the precise

contribution of listed co-authors.

11.4.1 Publication format: journal – accepted subject to revision

Subotic, B., Majumdar, A., and Ochieng, W.Y. (2007). Recovery from Equipment

Failures in Air Traffic Control (ATC): The findings from an international survey of

controllers. Accepted subject to revision to the International Journal of Engineering and

Operations: Air Traffic Control Quarterly. Air Traffic Control Association Institute, Inc.

11.4.2 Publication format: journal - published

Subotic, B., Ochieng, W.Y., and Straeter, O. (2007). Recovery from equipment failures

in ATC: An overview of contextual factors. The Reliability Engineering and System

Safety Journal, Vol 92 (7), pp. 858-870.

Subotic, B., Ochieng, W.Y., and Majumdar, A. (2005). Equipment Failures in Air Traffic

Control: Finding an Appropriate Safety Target. The Aeronautical Journal of the Royal

Aeronautical Society, Vol 109 (1096), pp.277-284.

11.4.3 Publication format: conference proceedings - published

Subotic, B., Ochieng, W. and Straeter, O. (2006). Recovery from Equipment Failures in

Air Traffic Control: A Probabilistic Assessment of Context. Proceedings of the

Probabilistic Safety Assessment (PSAM 08) conference, May 14-19, 2006, New

Orleans, USA.

Subotic, B., and Ochieng, W.Y. (2005). Recovery from Equipment Failures in Air Traffic

Control. In Contemporary Ergonomics 2005 (Eds. P.D. Bust and P. T. McCabe). Taylor

& Francis. Presented at the Ergonomics Society Annual Conference, De Havilland

Campus, University of Hertfordshire, Hatfield.

Chapter 12 List of References

309

12 List of References

10News (2006). Power Outage Momentarily Interrupts Air Traffic Control. From http://www.10news.com/news/8831526/detail.html

Air Transport Action Group (2005). The economic & social benefits of air transport. From http://www.atag.org/files/Soceconomic-124721A.pdf

Air Transport Association (2006). Cost of ATC Delays. From http://www.airlines.org/economics/specialtopics/ATC+Delay+Cost.htm

Airbus (2004). Global Market Forecast 2004-2023. From http://www.airbus.com/en/myairbus/global_market_forcast.html

Airways New Zealand (2006a). Manual of Air Traffic Services (amendment 113). Airways New Zealand.

Airways New Zealand (2006b). Domestic and International Aircraft Movements by Calendar Year. From http://www.airways.co.nz/documents/avimove_stats.pdf

Aviation International News (2001). Europeans embracing MLS with a vengeance. From http://www.ainonline.com/issues/04_01/Apr_2001_europeanmlspg75.html

Bainbridge, L. (1983). Ironies of Automation. Automatica, 19, 775-779. From http://www.bainbrdg.demon.co.uk/Papers/Ironies.html

Bainbridge, L. (1984). Diagnostic Skill in Process Operation. Department of Psychology, University College London. From http://www.bainbrdg.demon.co.uk/Papers/DiagnosticSkill.html

Baker, S., and Weston, I. (2001). Mayday, mayday, mayday. From http://www.isasi.org/working_groups/ats/atsmayday.pdf

Berenson, M.L., Levine, D.M., Krehbiel, T.C. (2006). Basic Business Statistics: Concepts and Applications. Prentice Hall: Upper Saddle River, NJ.

Billings, C.E. (1996). Aviation Automation: The Search for a Human-Centred Approach. Hillsdale, N.J.: Lawrence Erlbaum Associates.

Boehm-Davis, D., Curry, R.E., Wiener, E.L., and Harrison, R.L. (1983). Human factors of flight-deck automation: Report on a NASA industry workshop. Ergonomics, 26, 953-961.

Boeing (2004). Statistical Summary of Commercial Jet Airplane Accidents: Worldwide Operations 1959 – 2003. From http://www.boeing.com/news/techissues/pdf/statsum.pdf.

Bove, T. (2002). Development and Validation of a Human Error Management Taxonomy in Air Traffic Control. PhD dissertation. Risø National Laboratory, Roskilde. From http://www.risoe.dk/rispubl/SYS/syspdf/ris-r-1378.pdf


310

British Airways (2006). Flight Training Safety and Emergency Procedures (SEP) Training. From http://www.britishairwaysjobs.com/baweb1/?newms=info150

Brooker, P. (2004). Consistent and up-to-date aviation safety targets. Draft version. Cranfield University.

Brooker, P. (2006). Air Traffic Control Safety Indicators: What is Achievable? Eurocontrol: Safety R&D Seminar, 25-27 October 2006, Spain. From https://dspace.lib.cranfield.ac.uk/bitstream/1826/1372/1/Eurocontrol+2006+ATC-Brooker.pdf

Bureau of Transport and Regional Economics (2006). Aviation. Australian Government. From http://www.btre.gov.au/statistics/aviation.aspx

Bureau of Transportation Statistics (2004). Airline On-Time Statistics and Delay Causes. From http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp

Bureau of Transportation Statistics (2006). Dictionary. From http://www.bts.gov/dictionary/list.xml?letter=A

CASA (2006). ADS-B: Automatic Dependent Surveillance – Broadcast. Civil Aviation Safety Authority Australia. From http://casa.gov.au/pilots/download/ADS-B.pdf

Christensen, W.C., and Manuele, F.A. (1999). Safety through Design: Best Practices. National Safety Council Press.

Cox, K. (2005). Teamwork and Trust: A Pilot’s Perspective. From http://safecopter.arc.nasa.gov/Pages/Columns/SBrief/SafeBrf1Articles/6Teamwork.html

Damidau, A., Kirwan, B., and Scrivani, P. (2006). Safety Getting Real: Safety Insights from Real Time Simulations. Proceedings from the EUROCONTROL Safety R&D Seminar, Barcelona 25-27 October 2006, Spain.

Daniels, J.J., Regli, S.H., and Franke,J.L. (2002). Support for Intelligent Interruption and Augmented Context Recovery. Proceedings from 7th IEEE Human Factors Meeting. Scottsdale, Arizona.

Dekker, S., Fields, B., and Wright, P. (2004). Human Error Recontextualised. From http://www.cs.mdx.ac.uk/staffpages/bobf/papers/glasgow.pdf

Department of Defense (2001). Global Positioning System: Standard Positioning Service Performance Standard. Command, Control, Communication, and Intelligence. Washington DC.

Endsley, M. (1997). Situation Awareness, Automation & Free Flight. From http://atm-seminar-97.eurocontrol.fr/endsley.htm

Endsley, M. R., and Kaber, D. B. (1999). Level of automation effects on performance, situation awareness and workload in a dynamic control task. Ergonomics, 42(3), pp. 462-492.

Endsley, M., and Kiris, E. (1995). The out-of-the-loop performance problem and level of control in automation. Human Factors, 37(2), pp. 381-394.

EUROCONTROL (1997). EUROCONTROL Standard Document for Radar Surveillance in En-Route Airspace and Major Terminal Areas. From http://www.eurocontrol.int/surveillance/gallery/content/public/documents/SURVSTD.pdf

EUROCONTROL (1999). CD-ROM: An introduction to ATM. EUROCONTROL Institute of Air Navigation Services.


311

EUROCONTROL (2000a). Safety Minima Study: Review Of Existing Standards And Practices. From http://www.eurocontrol.int/src/gallery/content/public/documents/deliverables/srcdoc1ri.pdf

EUROCONTROL (2000b). Conflict Resolution Assistant Level 2 (CORA2): Controller Assessments (ASA.01.CORA.2.DEL02-b.RS).

EUROCONTROL (2000c). ESARR 2: Reporting and Assessment of Safety Occurrences in ATM. From http://www.atceuc.org/site/Eurocontrol/pdf02/esarr2%20v2.0%20en.pdf

EUROCONTROL (2001a). ECAC Safety Minima for ATM. EUROCONTROL Safety Regulation Commission.

EUROCONTROL (2001b). ESARR 4: Risk Assessment and Mitigation in ATM. EUROCONTROL Safety Regulation Commission. http://www.eurocontrol.int/src/gallery/content/public/documents/deliverables/esarr4v1.pdf

EUROCONTROL (2001c). Safety assessment of the free route airspace concept: Feasibility phase. Working Draft 0.3. European Organisation for the Safety of Air Navigation, EUROCONTROL. From http://www.eurocontrol.int/airspace/gallery/content/public/documents/frap/safety_assessment_report_integrated

EUROCONTROL (2001d). European Manual of Personnel Licensing - Air Traffic Controllers: Guidance on Implementation. From http://www.eurocontrol.int/humanfactors/gallery/content/public/docs/DELIVERABLES/L2%20(HUM.ET1.ST08.10000-GUI-01)%20Released-withsig.pdf

EUROCONTROL (2001e). Harmonisation of European Incident Definitions Initiative for ATM – HEIDI Viewer Instructions for Use. Safety, Quality and Standardisation Unit (SQS).

EUROCONTROL (2001f). EUROCONTROL Airspace Strategy for the ECAC States. From http://www.eurocontrol.int/eatm/gallery/content/public/library/airspace.pdf

EUROCONTROL (2002b). Technical Review of Human Performance Models and Taxonomies of Human Error in ATM (HERA). From http://www.eurocontrol.int/humanfactors/gallery/content/public/docs/DELIVERABLES/HF26 (HRS-HSP-002-REP-01) Released.pdf

EUROCONTROL (2002c). Glossary of Terms and Definitions & List of Acronyms (SRC DOC 4). From http://www.eurocontrol.int/src/gallery/content/public/documents/deliverables/srcdoc4e2.pdf

EUROCONTROL (2002d). Short Report on Human Performance Models and Taxonomies of Human Error in ATM (HERA). From http://www.eurocontrol.int/humanfactors/gallery/content/public/docs/DELIVERABLES/HF27%20(HRS-HSP-002-REP-02)%20Released.pdf

EUROCONTROL (2003a). MADAP in a Nutshell. Maastricht Upper Area Control Centre, Netherlands.

EUROCONTROL (2003b). Summer: ATFM summary report. From http://www.cfmu.eurocontrol.int/ATFM/public/docs/publicreport_2003year.pdf


312

EUROCONTROL (2003c). EUROCONTROL ATM Strategy for the Years 2000+, Volume 1. From http://www.eurocontrol.int/eatm/gallery/content/public/library/ATM2000-EN-V1-2003.pdf

EUROCONTROL (2003d). HERA-JANUS training: Analysing Human Error in Incident Investigation. 18-20 November 2003. EUROCONTROL Institute of Air Navigation Service, Luxembourg.

EUROCONTROL (2003e). The Human Error in ATM Technique (HERA-JANUS). From http://www.eurocontrol.int/humanfactors/gallery/content/public/docs/DELIVERABLES/HF30 (HRS-HSP-002-REP-03) Released-withsig.pdf

EUROCONTROL (2003f). Guidelines for Controller Training in the Handling of Unusual/Emergency Situations. From http://www.eurocontrol.int/humanfactors/gallery/content/public/docs/DELIVERABLES/T11%20(Edition%202.0)%20HRS-TSP-004-GUI-05withsig.pdf

EUROCONTROL (2003g). Radio and Navigation Aids Course (IANS_ATC_RADNAV). EUROCONTROL Institute of Air Navigation Service, Luxembourg.

EUROCONTROL (2003h). Area Navigation Applications in Europe. From http://elearning.eurocontrol.int/ATMTraining/precourse/nav/rnav/index.html

EUROCONTROL (2003i). ESARR 6: Software in ATM Systems. Safety Regulatory Commission. From http://www.eurocontrol.int/src/gallery/content/public/documents/deliverables/esarr6_e10_ri.pdf

EUROCONTROL (2004a). Evaluating the True Cost to Airlines of One Minute of Airborne or Ground Delay. Prepared by the University of Westminster for Performance Review Unit. From www.eurocontrol.int/prc/gallery/content/public/Docs/cost_of_delay.pdf

EUROCONTROL (2004b). MANTAS Basic Operational Concept, Version: Draft 0.2. EUROCONTROL.

EUROCONTROL (2004c). CORA 2 Safety Analysis: Exploratory Preliminary System Safety Assessment (PSSA). European Air Traffic Management Programme.

EUROCONTROL (2004d). Review of Techniques to Support the EATMP Safety Assessment Methodology. From http://www.eurocontrol.int/eec/gallery/content/public/documents/EEC_notes/2004/EEC_note_2004_01_1.pdf

EUROCONTROL (2004e). Managing System Disturbances in ATM: Background and Contextual Framework. From http://www.eurocontrol.int/humanfactors/gallery/content/public/docs/DELIVERABLES/HF47%20(HRS-HSP-005-REP-06)%20Released-withsig.pdf

EUROCONTROL (2004f). The Impact of Automation on Future Controller Skill Requirements and a Framework for SHAPE (HRS/HSP-005-REP-04). Human Factors Management Business Division (DAS/HUM).

EUROCONTROL (2004g). Model Based Simulation of the Turkish En-Route Airspace (EEC Report No. 396). From http://www.ans.dhmi.gov.tr/TR/ATCTR/proje/fts.pdf

EUROCONTROL (2005). ATM Contribution to Aircraft Accidents/Incidents: Review and Analysis of Historical Data. From http://www.eurocontrol.int/src/gallery/content/public/documents/deliverables/srcdoc2_e40_ri_web.pdf


313

EUROCONTROL (2006a). Air Traffic Control (ATC). From http://www.eurocontrol.int/corporate/public/standard_page/cb_airtraffic_controller.html

EUROCONTROL (2006b). What is PRNAV? From http://www.ecacnav.com/content.asp?PageID=82

EUROCONTROL (2006c). Performance Review Report covering the calendar year 2005. Performance Review Commission.

EUROCONTROL (2006d). The impact of fragmentation in European ATM/CNS. Performance Review Commission. From http://www.eurocontrol.int/prc/gallery/content/public/Docs/fragmentation.pdf

EUROCONTROL (2007a). Safety Nets. From http://www.eurocontrol.int/safety-nets/public/subsite_homepage/homepage.html

EUROCONTROL (2007b). Single European Sky. From http://www.eurocontrol.int/ses/public/subsite_homepage/homepage.html

European Commission (2001). Meeting society’s needs and winning global leadership. Report of the group of personalities. From http://ec.europa.eu/research/growth/aeronautics2020/pdf/aeronautics2020_en.pdf

European Commission (2006a). GNSS Autonomous Navigation Algorithms Critical Study (D3.2.2.1). Draft report. Sixth Framework Programme (2002-2006).

European Commission (2006b). Critical Analysis of Space-Based Navigation Technologies Usable for Civil Aviation (D3.1P). Draft report. Sixth Framework Programme (2002-2006).

European Space Agency (2002). Space Product Assurance: Safety (ESA Q-40-B). Requirements & Standards Division. Noordwijk, The Netherlands.

Federal Aviation Administration (1995). Approach Station Keeping (Ask) Experiment Plan and Final Report (DOT/FAA/CT-TN95/58). Department of Transportation: Federal Aviation Administration. From http://www.tc.faa.gov/acb300/techreports/TN9558.pdf

Federal Aviation Administration (1997). Hardware Product Specification Document for the Voice Switching and Control System (VSCS) (DTFA01–92–D–00004). Department of Transportation: Federal Aviation Administration.

Federal Aviation Administration (1998). Voice Switching and Control System: Attachment J-3 - Product Specification (FAA-E-2731G). Department of Transportation: Federal Aviation Administration.

Federal Aviation Administration (2000). System Safety Handbook, Chapter 3. Department of Transportation: Federal Aviation Administration. From http://www.asy.faa.gov/RISK/SSHandbook/contents.htm.

Federal Aviation Administration (2003). The Human Factors Design Standard (HF-STD-001). Compact disk, William J. Hughes Technical Center, Atlantic City International Airport, NJ.

Federal Aviation Administration (2005). Air Transportation Operations Inspector's Handbook (Order 8400), Vol 1. Department of Transportation: Federal Aviation Administration. From http://www.faa.gov/library/manuals/examiners_inspectors/8400/


314

Feng, S., Ochieng, W., Walsh, D., and Ioannides, R. (2005).A Measurement Domain Receiver Autonomous Integrity Monitoring Algorithm. GPS Solutions. Springer Berlin/Heidelberg.

Frese, M. (1991). Error Management or Error Prevention: Two Strategies to Deal with Errors in Software Design. In H. J. Bullinger (Ed.) Human aspects in Computing: Design and Use of Interactive Systems and Work with Terminals. Amsterdam: Elsevier Science Publishers.

Frese, M., Brodbeck, F.C., Zapf, D., & Prumper, J. (1990). The Effects of Task Structure and Social Support on Users’ Errors and Error Handling. In D. Diaper et al. (Eds.) Human – Computer Interaction - INTERACT’90 (pp.35-41). Amsterdam, Elsevier Science Publishers.

Fujita, Y., and Hollnagel, E. (2004). Failures without errors: quantification of context in HRA. Reliability Engineering and System Safety, 83, pp. 145-151.

Funk, K., Lyall, B., and Riley, V. (1996). Perceived Human Factors Problems of Flightdeck Automation: Phase 1 Final Report. Federal Aviation Administration Grant 93-G-039. From http://www.flightdeckautomation.com/phase1/phase1report.aspx

General Accounting Office (1982). Computer Outages at Terminal Facilities and Their Correlation to Near mid-air Collisions (AFMD-82-43). US GAO, Washington DC.

General Accounting Office (1991). Air Traffic Control: FAA Can Better Forecast and Prevent Equipment Failures. US GAO, Washington DC.

General Accounting Office (1996). Air Traffic Control: Good Progress on Interim Replacement for Outage-Plagued System, but Risks Can Be Further Reduced. US GAO, Washington DC.

General Accounting Office (1998). Air Traffic Control: Information Concerning Equipment Outages at Two Kansas City Area Facilities. US GAO, Washington DC.

Gordon, R., and Makings, N. (2003). Gate 2 Gate: Stakeholder Safety Survey. EUROCONTROL Experimental Centre, France.

Graham, G.M., Kinnersly, S and Joyce, A. (2002). Safety Reporting and Aviation Target Levels of Safety. In C.W. Johnson, Investigation and Reporting of Incidents and Accidents (IRIA 2002). Department of Computing Science, University of Glasgow, Scotland.

Hai, L. (2004). Civil Aviation Safety Outline (2001-2020). From http://www.seaskyad.com/ad@cca_english/content/content_0206_special_articles/article16.htm.

Hallbert B.P. and P. Meyer (1995). Summary of lessons learned at the OECD Halden reactor project for the evaluation of human-machine systems. Institutt for Energiteknikk, Halden, Norway.

Heinrich, H.W. (1941). Industrial Accident Prevention – A Scientific Approach. Mc Graw Hill: New York and Wiley: London.

Hilburn, B. (2004). Cognitive Complexity in Air Traffic Control - A Literature Review. EUROCONTROL Experimental Centre, EEC Note 04/04.

Hilburn, B., and Flynn, M. (2001). Air Traffic Controller and Management Attitudes Toward Automation: An Empirical Investigation. 4th USA/EUROPE Air Traffic Management R&D Seminar, Santa Fe, USA.


315

Hollnagel, E. (1993). Human Reliability Analysis: Context and Control. Academic Press, London.

Hollnagel, E. (1998). Cognitive Reliability and Error Analysis Method (CREAM). Elsevier Science Ltd., London, UK.

IEEE (1998). IEEE Guide for Microwave Communications System Development: Design, Procurement, Construction, Maintenance, and Operation. IEEE-SA Standards Board. From http://ieeexplore.ieee.org/iel4/5643/15123/00690973.pdf?arnumber=690973

IFALPA (2005). Interpilot: 60th Annual Conference: Boeing 787 programme update. From http://216.239.59.104/search?q=cache:oJuuByAkeqEJ:www.ifalpa.org/Interpilot/2005/06inp01.pdf+Interpilot:+60th+Annual+Conference:+Boeing+787+programme+update&hl=en&ct=clnk&cd=1&gl=uk

IFATCA (2004). Produce Definition of Controller Tools (Agenda Item B.5.2). Proceedings from 43rd Annual Conference, Hong Kong, 22-26 March 2004.

IFATCA (2005). A Positive Step to Improve Aviation Safety. From http://www.ifatca.org/press/141105.pdf

International Civil Aviation Organization (1979). Annex 5: Units of Measurement to be Used in Air and Ground Operations. Montreal, Canada.

International Civil Aviation Organization (1985). Manual of Air Traffic Forecasting (Doc 8991-AT/722/2). Montreal, Canada.

International Civil Aviation Organization (1994). All-Weather Operations Panel. Fifteenth meeting. Montreal, Canada.

International Civil Aviation Organization (1995). Review of the General Concept of Separation panel (RGCSP). Working Group A: A Review of Work on Deriving a Target Level of Safety (TLS) for En-route Collision Risk. Montreal, Canada.

International Civil Aviation Organization (1997). Outlook for Air Transport to the Year 2005 (ICAO Circular 270-AT/111). Montreal, Canada.

International Civil Aviation Organization (1998). Human Factors Training Manual – Doc 9683 (First Edition). Montreal, Canada.

International Civil Aviation Organization (2001a). Air Traffic Management Doc 4444. Montreal, Canada.

International Civil Aviation Organization (2001b). Annex 6: Operation of Aircraft. Montreal, Canada.

International Civil Aviation Organization (2001c). Annex 11: Air Traffic Services. Montreal, Canada.

International Civil Aviation Organization (2001d). Annex 13: Aircraft Accident and Incident Investigation. Montreal, Canada.

International Civil Aviation Organization (2001e). Annex 1: Personnel Licensing. Montreal, Canada.

International Civil Aviation Organization (2003). Review the latest developments in the ATN Panel and the Aeronautical Mobile Communication Panel. From http://www.icao.int/icao/en/ro/apac/atn_2003/ip02.pdf

International Civil Aviation Organization (2005). Report of the Ninth Meeting of Communications, Navigation And Surveillance/Meteorology Sub-Group


316

(Cns/Met/Sg/9) Bangkok, Thailand 11– 15 July 2005. From http://www.icao.int/icao/en/ro/apac/2005/CNS_MET_SG9/CNSMET_SG9.pdf

International Civil Aviation Organization (2006a). Review Developments Relating to CNS/ATM Implementation: Review the Work by RNP Special Operational Requirements Study Group on the Implementation of RNP Operations. From http://www.icao.int/icao/en/ro/apac/2006/ATM_AIS_SAR_SG16/wp22.pdf

International Civil Aviation Organization (2006b). Contracting States. From http://www.icao.int/cgi/goto_m.pl?/cgi/statesDB4.pl?en

International Civil Aviation Organization (2007). CNS/ATM Systems. From http://www.icao.int/icao/en/ro/rio/execsum.pdf

Jeppesen (2001). Required Navigation Performance (RNP). Jeppesen Briefing Bulletin. From http://www.jeppesen.com/download/briefbull/den01-j.pdf

Johnson, C. W. and Holloway, C.M. (2004). On the Over-Emphasis of Human ‘Error’ As A Cause of Aviation Accidents: ‘Systemic Failures’ and ‘Human Error’ in US NTSB and Canadian TSB Aviation Reports 1996-2003. From http://www.dcs.gla.ac.uk/~johnson/papers/Cause_comparisons/Error_and_accidents.PDF

Joint Aviation Administration (1994). Joint Aviation Requirements for Large Aeroplanes (JAR–25).

Kaarstad M., Ludvigsen J.T. (2002). Background study for further research in performance recovery. Presented at Enlarged Halden Programme Group Meeting, Storefjell,C2/5/1–16.

Kaber D.B. (1997). The Effect of Level of Automation and Adaptive Automation on Performance in Dynamic Control Environments (ANRCP-NG-ITWD-97-01). Amarillo, TX: Amarillo National Resource Center for Plutonium.

Kaber, D. B. and Riley, J. (1999). Adaptive automation of a dynamic control task based on secondary task workload measurement. International Journal of Cognitive Ergonomics, 3(3), 169-187.

Kaber, D.B., Prinzel, L.J., Wright, M.C., and Clamann, M.P. (2002). Workload-Matched Adaptive Automation Support of Air Traffic Controller Information Processing Stages (NASA/TP-2002-211932). National Aeronautics and Space Administration. From http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20020080640_2002133430.pdf

Kanse, L. (2004). Recovery uncovered: How people in the chemical process industry recover from failures. PhD dissertation. Eindhoven University of Technology.

Kanse, L. and van der Schaaf, T. (2000). Recovery from failures - understanding the positive role of human operators during incidents. In by D. de Waard, C. Weikert, J. Hoonhout and J. Ramaekers (Eds.), Human System Interaction: Education, Research and Application in the 21st Century. Maastricht, Netherlands: Shaker Publishing.

Kennedy, R., Kirwan, B., and Summersgill, R. (2000). Making HRA a more consistent science. In Foresight & Precaution, Eds. Cottam, M., Pape, R.P., Harvey, D.W., and Tait,J. Balkema, Rotterdam.

Kim, M.C., Seong, P.H., and Hollnagel, E. (2005). A probabilistic approach for determining the control mode in CREAM. Reliability Engineering and System Safety, pp. 1-9.


317

Kirwan, B. (1994). A Guide to Practical Human Reliability Assessment. Taylor & Francis, London, UK.

Kirwan, B. (1997). The development of a nuclear chemical plant human reliability management approach: HRMS and JHEDI. Reliability Engineering and System Safety, Vol 56, pp. 107-133.

Kirwan, B., Gibson, H., Edmunds, J., Cooksley, G., Kennedy, R., and Umbers, I. (1994). Nuclear Action Reliability Assessment (NARA): A Data-Based HRA Tool.

Kirwan, B., Basra, G., and Taylor-Adam, S.E. (1997). CORE-DATA: A Computerised Human Error Database for Human Reliability Support. Proceedings from the Sixth Annual Human Factors Meeting, Orlando, US.

Kontogiannis, T. (1999). User strategies in recovering from system failures in man-machine systems. Safety Science 32(1), pp. 49-68.

Kopardekar, P., and Magryratis, S. (2003). The measurement and prediction of dynamic density. Presented at the FAA-EUROCONTROL ATM 2003 Seminar, Budapest.

Lanzi, P., and Marti, P. (2001). Innovate or preserve: when technology questions cooperative processes. From http://www.dblue.it/pdf/ECCE11_Lanzi_Marti_v3.pdf

Layton, C., Smith, P. J., and McCoy, E. (1994). Design of a cooperative problem-solving system for en-route flight planning: An empirical evaluation. Human Factors, 36, pp. 94-119.

Leveson N.G. (1995). Safeware: System Safety and Computers. Addison- Wesley publishing company, New York.

Littlewood, B., Strigini, L., Wright, D., and Courtois, P.J. (1998). Examination of Bayesian Belief Network for Safety Assessment of Nuclear Computer-Based Systems ESPRIT DeVa Project 20072). From http://www.csr.city.ac.uk/people/lorenzo.strigini/ls.papers/DeVa_BBN_reports/DeVaTR70_year3.5a/DeVaTR70.pdf

Low, I. and Donohoe, L. (2001). Engineering Psychology and Cognitive Ergonomics Volume 5: Aerospace and Transportation Systems. Edited by Don Harris. Methods for assessing ATC controllers’ recovery from automation failures. National Air Traffic Service (NATS), UK.

Majumdar, A., and Ochieng, W.Y. (2002). Estimation of European Airspace Capacity from a Model of Controller Workload. Journal of Navigation, Vol 55(3), pp. 381-403.

Majumdar, A., Ochieng, W.Y., McAuley, G., Lenzi, J.M., and Lepadatu, C. (2004). The Factors Affecting Airspace Capacity in Europe: A Cross-Sectional Time-Series Analysis Using Simulated Controller Workload. Journal of Navigation, Vol 57(3), pp.385-405.

Massaiu, S., Haugset, H., and Bjorlo, T.J. (2003). Human Reliability Issues in Traffic Control Centres. Norwegian Research Council.

Mauri, G. (2000). Integrating Safety Analysis Techniques, Supporting Identification of Common Cause Failures. PhD thesis, The University of York.

Metzger, U., and Parasuraman, R. (2005). Automation in future air traffic management: Effects of decision aid reliability on controller performance and mental workload. Human Factors, 47(1), 35-49.


318

Ministry of Land, Infrastructure, and Transport (2006). Statistics. Air Traffic Activity at Cab Facilities: Area Control Center. From http://www.mlit.go.jp/koku/04_hoan/e/statistics/image/00_00.gif

Mohleji, S., C., Lacher, A. R., and Ostwald, P.A. (2003). CNS/ATM System Architecture Concepts and Future Vision of NAS Operations. In 2020 Timeframe. Center for Advanced Aviation System Development (CAASD), The MITRE Corporation. From http://www.mitre.org/work/tech_papers/tech_papers_03/mohleji_2020/mohleji_2020.pdf

National Aeronautics and Space Administration (2000). Required Communication Performance (RCP). From http://as.nasa.gov/aatt/wspdfs/Oishi.pdf

National Aeronautics and Space Administration (2002). NASA Safety Manual w/Changes through Change 1 (NPR 8715.3). NASA QS / Safety & Risk Management Division.

National Air Traffic Services (1999). Testing Operational Scenarios for Concepts in ATM (Phase II). WP2: Airspace Sectorisation Optimisation. European Commission.

National Air Traffic Services (2002). Manual of Air Traffic Services Part II. London Area Control Centre, edition 2/02.

National Air Traffic Services (2004). NATS apologises for delays experienced today. From http://www.nats.co.uk/news/news_stories/2004_06_03_2.html

National Transportation Library (1997). Potential Cost Savings Ideas for FAA and Users. From http://ntl.bts.gov/lib/000/500/511/costsav.pdf

National Transportation Safety Board (1973). Aircraft Accident Report (AAR-73-14). From http://amelia.db.erau.edu/reports/ntsb/aar/AAR73-14.pdf

National Transportation Safety Board (1983). Aircraft Accident Report (AAR-83-02). From http://amelia.db.erau.edu/reports/ntsb/aar/AAR83-02.pdf

National Transportation Safety Board (1996).Special Investigation Report: Air Traffic Control Equipment Outages. Washington, D.C.

Nolan, M. S. (1998). Fundamentals of Air Traffic Control. Belmont, USA: Wadsworth Publishing Company.

Nuclear Regulatory Commission (1998). Technical Basis and Implementation Guidelines for a Technique for Human Event Analysis (ATHEANA). NUREG-1624. U.S. Nuclear Regulatory Commission, Washington, DC.

Ochieng, W.Y. (2006). Future Air Traffic Management. Course presentation for Air Traffic Management Module (T23). Imperial College London.

Orasanu, J., and Fischer, P. (1997). Finding decisions in natural environments: the view from the cockpit. In Zsambok, C.E. & Klein, G. Mahwah (Eds) Naturalistic decision-making. New Jersey: Lawrence Erlbaum Associates Publishers.

Oren, T., and Ghasem-Aghaee, N. (2003). Personality Representation Processable in Fuzzy Logic for Human Behavior Simulation. Summer Computer Simulation Conference, July 20-24, 2003. Montreal, Canada. From http://www.site.uottawa.ca/~oren/pres/pres-of-2003-01-SCSC-personality.pdf

Parasuraman, R., and Riley, V. (1997). Humans and automation: use, misuse, disuse, abuse. Human Factors Vol 39, 230-253.


319

Parasuraman, R., Bahri, T., Deaton, J., Morrison, J., and Barnes, M. (1990). Theory and Design of Adaptive Automation in Aviation Systems. Technical Report No. CSL-N90-1, Cognitive Science Laboratory. Catholic University of America, Washington, DC.

Parasuraman, R., Mouloua, M., and Molloy, R. (1996). Effects of adaptive task allocation on monitoring of automated systems. Human Factors. 38. pp. 665-679.

Parasuraman, R., Wickens, C. D., and Sheridan, T. (2000). A model for types and levels of human interaction with automation. IEEE Transactions on Systems, Man, and Cybernetics, 30(3), 286-297.

Park, J., Jung, W., Ha, J., and Shin, Y. (2004). Analysis of operators’ performance under emergencies using a training simulator of the nuclear power plant. Reliability Engineering and System Safety, 83, pp. 179-186.

Perrow, C. (1999). Normal Accidents. Princeton University Press.

Piantek, T.W. (1999). Influence in contracting and purchasing. In Safety Through Design: Best Practices (EDS. Christensen, W.C., Manuele, F.A.). National Safety Council Press.

PPrune Forums (2006). ATC Issues. From http://www.pprune.org/forums/forumdisplay.php?s=ac64e2a0afd13472a93e7df2bba4b826&f=18

Rail Safety and Standards Board (2004). Rail-Specific HRA Tool for Driving Tasks Phase 1 Report. From http://www.rssb.co.uk/pdf/reports/research/T270 Rail-specific HRA tool for driving tasks Phase 1 report.pdf

Rasmussen, J. (1982). Human errors: A taxonomy for describing human malfunction in industrial installations. Journal of Occupational Accidents, 4, 311-335.

Reason, J.T. (1997). Managing the risks of organizational accidents. Aldershot, England: Ashgate Publishing.

Reid, J.W. (1996). Safety by Design. Lecture 4: Cost and acceptability of risk. Hazardous forum: London.

Rigas, G. and Elg, F. (1997). Mental models, confidence, and performance in a complex dynamic decision making environment. Department of Psychology, Uppsala University, Sweden. From http://www.ie.boun.edu.tr/labs/sesdyn/isdc97/TURKIA.doc

RISKS (2000). U.K. ATC System Failure. The RISKS Digest, Vol 20, issue 94. From http://catless.ncl.ac.uk/Risks/20.94.html

Rizzo, A., Ferante, D., and Bagnara, S. (1995). Handling human error. In J.M. Hoc, P.C. Cacciabue, & E. Hollnagel (Eds.), Expertise and Technology: Cognition & Human-Computer Cooperation (pp. 195-212). Hillsdale, NJ: Lawrence Erlbaum.

Saldana, M. A. M., Herrero, S. G., del Campo, M. A. M. and Ritzel, D. O. (2002). Assessing Definitions and Concepts within the Safety Profession. From http://www.aahperd.org/iejhe/2003_first/ritzel.pdf.

Sampaio, J. J. M., and Guerra, A. A. (2004). The day god failed or overtrust in automation: The Portuguese case study. In Proceedings from the 2nd Conference on Human Performance Situation Awareness and Automation (HPSAA 2). Daytona Beach, FL.


320

Scerbo, M.W. (2005).Adaptive Automation. Department of Psychology Old Dominion University. From http://www.cs.colorado.edu/~mozer/courses/6622/papers/aachpt05-12-15.htm

Sellen, A. J. (1994). Detection of everyday errors. Applied psychology: An International Review 43(4), pp. 475-498.

Shappell, S.A. (2000). The Human Factors Analysis and Classification System-HFACS (DOT/FAA/AM-00/7). Federal Aviation Administration. US Department of Transportation. From http://www.nifc.gov/safety_study/accident_invest/humanfactors_class&anly.pdf

Sheridan, T.B. (1980). Computer control and human alienation. Technology Review Vol 10, pp.61-73.

Shier, R. (2004). The Mann-Whitney U Test. Matematics Learning Support Centre. From http://mlsc.lboro.ac.uk/documents/Mannwhitney.pdf

Shorrock, S. (1992). Error Classification for Safety Management: Finding the Right approach. In C.W. Johnson (Ed.), Investigation and Reporting of Incidents and Accidents IRIA 2002 (pp. 57-67). From http://www.dcs.gla.ac.uk/~johnson/iria2002/IRIA_2002.pdf

Shorrock, S. T., and Kirwan, B. (2002). Development and application of a human error identification tool for air traffic control. Applied Ergonomics, Vol 33, pp. 319–336.

Smith, S.P., Harrison, M.D. and Schupp, B.A. (2004). How explicit are the barriers to failure in safety arguments? Computer Safety, Reliability, and Security (SAFECOMP'04). In M. Heisel, P. Liggesmeyer and S. Wittmann (Eds), Lecture Notes in Computer Science Vo 3219, pp. 325-337, Springer.

Sorensen, J.N. (2002). Safety culture: a survey of the state-of-the-art. Reliability Engineering and System Safety, Vol 76, pp. 189-204.

Straeter, O. (2000). Evaluation of human reliability on the basis of operational experience. Dissertation at Munich Technical University.

Straeter, O. (2001). The quantification process for human interventions. In: Kafka, P. (ed.) PSA RID – Probabilistic Safety Assessment in Risk Informed Decision making. EURO-Course. 4.- 9.3.2001. GRS. Germany.

Straeter, O. (2005). Cognition and Safety: An Integrated Approach to Systems Design and Performance Assessment. Ashgate: Aldershot.

Subotic, B., Ochieng, W.Y., and Majumdar, A. (2005). Equipment Failures in Air Traffic Control: Finding an appropriate safety target. The Aeronautical Journal of the Royal Aeronautical Society, Vol 109(1096), p. 277-284.

Subotic, B., Ochieng, W.Y., and Straeter, O. (2006a). Recovery from equipment failures in ATC: An overview of contextual factors. Reliability Engineering and System Safety Journal Vol 92 (7), pp. 858-870.

Subotic, B., Ochieng, W. and Straeter, O. (2006b). Recovery from Equipment Failures in Air Traffic Control: A Probabilistic Assessment of Context. Probabilistic Safety Assessment (PSAM 08) Conference, May 14-19, 2006, New Orleans, US.

Swain, A. D., and Guttman, H. E. (1983). Handbook of human reliability analysis with emphasis on nuclear power plant applications (NUREG/CR-1278). Washington D.C.

Theis, I. and Sträter, O. (2001). By-Wire Systems in Automotive Industry. Reliability Analysis of the Driver-Vehicle-Interface Proceedings. ESREL 2001, Turin.


321

THEMES (2001). Thematic Network for Safety Assessment of Waterborne Transport. Deliverable No. D5.1. Report on Safety and Environmental Assessment Method. From http://projects.dnv.com/themes/Deliverables/D5.1Final.pdf

Theureau J., Jeffroy F. and Vermersch P. (2000). Controlling a nuclear reactor in accidental situations with symptom-based computerized procedures: a semiological & phenomenological analysis. Proceedings from CSEPC 2000. Taejon, Corée, 22-25 Novembre.

UK Civil Aviation Authority (2000). Aviation safety review 1990-1999 (CAP 701). Civil Aviation Authority, London.

UK Civil Aviation Authority (2003). United Kingdom Manual of Personnel Licensing - Air Traffic Controllers (CAP 744). Civil Aviation Authority. London.

UK Civil Aviation Authority (2004). Fact Sheet - SSR Mode S, Edition 1.2. From http://www.caa.co.uk/docs/810/DAP_SSM_Mode_S_SSR_Factsheet.pdf

UK Civil Aviation Authority (2005). Mandatory Occurrence Reporting Scheme. CAP 382. Civil Aviation Authority, London. From http://www.caa.co.uk/docs/33/CAP382.PDF

UK Civil Aviation Authority (2006). Manual of Air Traffic Services - Part 1 (CAP 493). Civil Aviation Authority, London. From http://www.caa.co.uk/docs/33/CAP493Part1.pdf

United Nations (2006). UN in Brief. From http://www.un.org/Overview/brief1.html#footnote

van der Schaaf, T. W. (1992). Near miss reporting in the chemical process industry. PhD thesis. Eindhoven University of Technology.

van der Schaaf, T.W. (1995). Human recovery of errors in man-machine systems. Proceedings of the Sixth IFAC/IFIP/IFORS/IEA Symposium on the Analysis, Design and Evaluation of Man–Machine Systems. Cambridge, MA.

van Es, G.W.H. (2003). Review of Air Traffic Management-related accidents worldwide: 1980-2001. National Aerospace Laboratory (NLR).

Ward, M., Grupen, L., Regehr, G. (2002). Measuring Self-assessment: Current State of the Art. Advances in Health Sciences Education, 7, pp. 63–80.

Weisberg, H.F., Krosnick, J.A., and Bowen, B.D. (1996). An Introduction to Survey Research, Polling, and Data Analysis. SAGE Publications: London.

Wickens, C.D. (1992). Engineering psychology and human performance, 2nd Ed. New York: Harper Collins.

Wickens, C.D. (2001). Attention to Safety and the Psychology of Surprise. From http://www.aviation.uiuc.edu/UnitsHFD/conference/Osukeynote01.pdf

Wickens, C.D., Lee, J.D., Liu, Y., and Gordon Becker, S.E. (2004). An Introduction to Human Factors Engineering. New Jersey: Pearson Prentice Hall.

Wickens C.D, Mavor, A. and McGee, J.P. (Eds.) (1997). Flight to the Future: Human Factors in Air Traffic Control. Washington, DC: National Academy Press.

Wickens, C.D., Mavor, A. S., Parasuraman, R., and McGee, J.P. (1998). The Future of Air Traffic Control: Human Operators and Automation. National Academy Press: Washington, DC.

Wiener, E.L. and Curry, R.E. (1980). Flight deck automation: promises and problems. Ergonomics, Vol 23, pp. 995-1011.


322

Williams, J.C. (1986). HEART – A Proposed Method for Assessing and Reducing Human Error. In 9th Advances in Reliability Technology Symposium. University of Bradford, 1986.

Wood, A. (1996). Software Reliability Growth Models. From http://www.hpl.hp.com/techreports/tandem/TR-96.1.pdf

Zapf, D., and Reason, J.T. (1994). Introduction: Human Error and Error Handling. Applied psychology: An international review, Vol 43(4), pp. 4127-432.

Appendices

323

Appendices

Appendix I The cost of delays induced by ATC equipment failures

Appendix II Interviews with ATM staff

Appendix III Checklist for the Equipment Failure Scenarios in a specific European

ATC Centre - An Aide-Memoire framework

Appendix IV The questionnaire design

Appendix V Example of one questionnaire response

Appendix VI Results extracted from the question 5 of the questionnaire survey

Appendix VII Overview of contextual factors

Appendix VIII Probabilities for 20 Recovery Influencing Factors (RIFs)

Appendix IX Questions for the ATM Specialist

Appendix X Overview of RIFs, their corresponding levels, and designated

probabilities

Appendix XI Validation of the RIFs interaction matrix

Appendix XII Distribution of 20 Recovery Influencing Factors (RIFs)

Appendix XIII Experimental material

Appendix XIV Overview of RIFs, their corresponding levels, and probabilities

determined in the experimental investigation


Appendices

324

Appendix I The cost of delays induced by ATC equipment failures The impact of an equipment failure on ATM can be analysed from several different

perspectives. From a financial perspective, it is necessary to consider the costs

identified in ATC and the cost of delays in a wider region. A small exercise has been

conducted on the cost of delays induced by ATC equipment failures in the European

Civil Aviation Conference (ECAC) and US airspace.

From EUROCONTROL’s Central Flow Management Unit (CFMU) data for the period

from 1999 to 2003 (Table 1), ATC equipment failure induced delays are split between

en route and airports respectively. Given that the cost of one minute delay in Europe in

the year 2002 is estimated to be EUR72 (EUROCONTROL, 2004a), the last column of

Table 1 presents total costs incurred by airlines as a result of airborne and ground

delays. It is important to highlight that the estimate for the cost of one minute delay

(EUR72) is based on primary delay costs, reactionary delay costs (e.g. ‘knock-on’

effect to the other aircraft), as well as fuel, maintenance, ground handling of aircraft

and passengers, passenger costs of delay to the airline, and future loss of market

share due to lack of punctuality (EUROCONTROL, 2004a). As a result, the calculated

annual cost of delays caused by ATC equipment failures accounts for all relevant costs

and thus demonstrates the high cost of technical failures.

Table 1 ATC equipment as a cause of airport and enroute delays (personal correspondence1)

Year Enroute Delay

(min) Airport Delay

(min) Total Delay

(min)

Annual cost for the airlines (million EUR) based on the

year 2002

1999 609265 461290 1070555 77.08

2000 598660 265055 863715 62.19

2001 614534 406760 1021294 73.53

2002 425627 138045 563672 40.58

2003 149476 147528 297004 21.38

There are a number of reasons for the differences in the delay reported by the CFMU

(Table 1) for a given period. Some global factors explaining the delay reductions in the

decade beginning in 2000, are the general reduction of air traffic (as a result of post

September 11th 2001 crisis in the aviation industry), the presence of severe factors

(e.g. closure of Yugoslav airspace in 1999), the introduction of new route structures in

1999, the influence of European ATM network programs (e.g. Reduced Vertical

1 Personal correspondence with EUROCONTROL CMFU.

Appendices

325

Separation Minima-RVSM, improved capacity management), and staffing issues that

reached the highest record in 2002 (EUROCONTROL, 2003b).

Similar calculations have been carried out for the impact of ATC equipment failures on

the overall US’s National Aviation System (NAS). The US NAS consists of aircraft,

pilots, facilities, controllers, airports, maintenance personnel, together with computers,

communications equipment, satellite navigation aids, and radars. Direct aircraft

operating cost per minute of delay is calculated according to the Air Transport

Association (ATA) estimates for the year 2005, which is $62.33 (Air Transport

Association, 2006). This cost comprises of fuel burn, extra crew time, maintenance,

aircraft ownership costs, and additional costs. These additional costs account for costs

of extra gates and manpower on the ground and costs imposed on airline customers

(passengers and cargo shippers) in the form of lost productivity, wages, and customer

satisfaction. The FAA estimates average cost of delay to air travelers to be $30.26 per

hour or $0.50 per minute (Air Transport Association, 2006). As a result, the average

costs of ATC equipment induced failures for the year 2004 and 2005 are given in Table

2.

Table 2 ATC equipment as a cause of the US National Aviation System delays. From Bureau of Transportation Statistics (2004), summaries available only for the whole 2004 and 2005

Year ATC equipment (min) Average cost (millions $)

2004 402644 25.10

2005 274126 17.09

In general, these high-level analyses illustrate that equipment failures can significantly

affect operational, safety, and financial aspects of both ATC and ATM systems. Both

methods (employed for Europe and the US) for calculating the cost of the delay per

minute are largely similar. The only difference is the financial value assigned to each

minute of delay in Europe and the US. In addition, the ‘true’ cost of equipment failure

induced delay should also incorporate technical repair, unscheduled maintenance,

training, and additional staffing. However, it is assumed that these costs represent only

a fraction when compared to the cost of delay per minute. Therefore, it can be

concluded that these estimates are a reasonable representation of the total cost

induced by ATC equipment failure both in the European and the US aviation markets.

Appendices

326

Appendix II Interviews with ATM staff

Interviews with relevant Air Traffic Management (ATM) staff, as a method of data

collection, have been conducted to support the research presented in this thesis and to

augment available theoretical findings. They aimed to extract operational experience of

ATM specialists and experienced system control and monitoring engineers. The focus

of these interviews has been on four research areas. These are:

� classification of ambiguous operational failure reports;

� characteristics of air traffic controllers training;

� characteristics of equipment failures in Air Traffic Control (ATC); and

� contextual factors relevant to controller recovery from equipment failures in ATC.

Interviews with ATM specialists focused on the air traffic controller training (ab initio,

recurrent, and emergency training) and contextual factors relevant to controller

recovery. Interviews with system control and monitoring engineers revealed their

experiences related to the characteristics of ATC equipment failures.

The sample of ATM staff interviewed is as follows:

� system control and monitoring engineers from four countries:

o National Air Traffic Services (NATS), Corporate and Technical Centre (CTC)

and Swanwick Centre, UK;

o EUROCONTROL Maastricht Upper Area Control Centre (MUAC),

Netherlands;

o Irish Aviation Authority (IAA);

o Airports Authority of India (AAI);

� ATM specialists from two countries:

o EUROCONTROL Institute of Air Navigation Services (IANS), Luxembourg;

o Irish Aviation Authority (IAA).

Findings related to each research area are presented below.

Appendices

327

Table A-1 Findings related to the clarification of ambiguous operational data

Location Number of participants interviewed

Research question

Finding Agreement

between study participants

UK NATS (CTC) one experienced

engineer Ambiguous operational

failure reports

Proper classification of all operational failure reports

Yes, clarified all ambiguities EUROCONTROL

MUAC two experienced

engineers

Table A-2 Findings related to the air traffic controllers training


Research question

Findings Agreement


EUROCONTROL IANS

one ATM specialist

Usefulness of announcing the

training for unusual/emergen

cy situations

Although controllers may anticipate an

unusual occurrence within their

emergency training, this does not

facilitate better performance as

long as they do not know the nature of

that unusual occurrence

Yes, both agreed

IAA one ATM specialist

Table A-3 Findings related to the characteristics of equipment failures in ATC


Research question

Finding Agreement



engineer Existence of latent failures

Latent failures tend to go unnoticed until some other event or failure reveals their

existence.

Yes, experienced

latent software failures

EUROCONTROL MUAC

one experienced engineer

IAA one experienced

engineer


engineer Complexity of

failure type

Majority of ATC equipment failures

affect single system. Yes

EUROCONTROL (MUAC)

two experienced engineers

IAA one experienced

engineer


engineer Time course of

failure development

Majority of failures tend to manifest

themselves suddenly

Yes EUROCONTROL

(MUAC) two experienced

engineers

IAA one experienced

engineer

Appendices

328

Table A-4 Findings related to the contextual factors relevant to controller recovery from equipment failures in ATC


Research question

Finding Agreement between

study participants

IAA two ATM

specialists

Contextual factors relevant

to controller recovery from

equipment failures in ATC

Validation of the candidate

contextual factors

Agreed on selected contextual factors and aided the definition of

each factor

IAA three ATM specialists

Interactions between

contextual factors

Validation of interactions

between contextual factors identified using operational

experience and the past research

Their feedback was similar. Identified

inconsistencies were further clarified during the

interview and were the result of the

misperception of some factors. All

inconsistencies were clarified.

Appendices

329

Appendix III Checklist for the Equipment Failure Scenarios in ATC Centre - An Aide-Memoire framework

This section provides a framework for the design of the Aide-Memoire or checklist type

procedures for recovery from equipment failures in a particular ATC Centre. The

proposed framework is adapted to an ATC Centre that participated in the experimental

investigation segment of the research presented in this thesis. This Aide-Memoire

provides a potential framework, which needs be further discussed and developed in

accordance with the in-house expertise of the system control and monitoring staff and

ATM specialists of a respective ATC Centre. However, the concept and the design

solution presented here is transferable across ATC Centres.

Contents

Once all equipment failures to be included in the Aide-memoire have been defined,

they could be categorised into four distinct groups based upon their impact on ATC

operations (as discussed in Chapter 4). These four categories are as follows:

� Major impact to operations room (all sectors/all workstations) – severe flow

restrictions possible. Relevant failures are:

o ONL LAN failure

o Failure of the Surveillance Network

o Failure of COMPAD

o Loss of Flight Server

o Loss of Track Server

o Loss of SSR and PSR

o Loss of FDPS

o Loss of MRP

� Moderate impact to operations room - impact to one or several workstation in

different suite, possible need to combine/move positions immediately and

possible flow restrictions. Relevant failures are:

o Reduced radar data mode

o Reduced alert mode

o Reduced communication mode

o Loss of ARTAS

o Loss of VCS panel

o Loss of a single CWP

o Loss of entire sector suite

o Loss of SRP

Potential colour coding in Aide-

Memoire RED


Memoire YELLOW

Appendices

330

o Loss of adjacent sector

� Minimal impact – not immediately critical but may have greater operational

impact over time. Relevant failures are:

o Radar Data Function failure

o Loss of single frequency

o Overload of SRP

o Overload of MRP

o Loss of external feeds to AIS

o Loss of STCA

o Loss of APW

o Loss of MSAW

o Loss of OLDI

o Loss of paper strip printer

Note that the categorisation above lists some but not all possible failures. Those

marked in italics are designed in the Aide-Memoire format and are presented below.

Further input from system control and monitoring staff and ATM specialists may yield

more accurate and precise types of failures and recovery steps to be taken.

Design

At the top of each procedure, it would be useful to have the appearance of the pictorial

Human Machine interface (HMI) warning, if applicable (e.g. the highlighted labels on

the General Information Window). This would be followed by the presentation of the

two types of information. Firstly, the required recovery steps, i.e. those that a controller

must perform to recover effectively and ensure safe air traffic control service. Secondly,

the key effects of the equipment failure on the ATC system (i.e. the ATC system

feedback). The rational for this design solution is that the top part of the checklist

should be reserved for the items that controllers should be aware of first, i.e. recovery

steps.

In addition, it is necessary to define procedures for different personnel working in the

operational environment, namely controllers (i.e. different roles for executive, planner,

and assistant controller), supervisors, and managers to assure a seamless recovery

process. If, for example, radar services fail on all workstations, personnel should have

a readily available guide to help them recover from the failure. These guidelines may

vary according to the type of user, because different roles may require different

information on equipment failures and recovery procedures.


Memoire GREEN

Appendices

331

Note that the colour-coded categorisation could be used in a slightly different manner

as well. If this Aide-Memoire becomes a part of the generic procedures for handling

emergency/unusual situations than the use of colour should be restricted to categories

such as ‘Aircraft Emergencies’, ‘Equipment Failures’, ‘Fire and Building Evacuation’.

The Aide-Memoire, as a hard, laminated copy flip chart, should be readily available on

each Controller Working Position (CWP). A more detailed version, providing local or

ATC Centre specific data, should be at the supervisor’s position. For simplicity and

efficiency, it is better to present each relevant failure on a single page highlighting the

two main areas: what recovery steps to perform and what feedback to expect from the

ATC system. This approach assures the most efficient usage of the tool.

The final version of the Aide-Memoire should not be considered as an exhaustive list

but more of a living document. In other words, it will be necessary to update this tool on

annual basis to reflect the local expertise and to compile all changes (i.e. changes in

the ATC system, both software and hardware).

Appendices

332

ONL LAN Failure

ATCO actions:

− Inform Coordinator − Inform all traffic − Check spare ODS − Maintain timely & accurate strip marking − Restrict traffic − Utilise holding patterns − Use only verbal coordination channels − Reaffirm traffic identification using the code on the FPS − Identify any new tracks using the “Confirm Squawk?”

method − Seek SAS assistance and print screen if possible − Ground all sport/non-commercial traffic ASAP − Utilise strategic ATC techniques when possible − Conduct regular checks of aircraft identification − Monitor Mode C closely − Be aware of the absence of Safety Nets and Monitoring

Aids − Cross check that exit conditions are achieved − Expedite reduction in traffic load

Appendices

333

ONL LAN Failure (Cont’d)

Expect:

The radar data is distributed via the RFS LAN

The following functions are NOT AVAILABLE:

− Safety Nets and Monitoring Aids (existing alarms maintained)

− Flight Plan function (no coupling, no RAM & CLAM) − Radar Data function replaced by Radar Fallback function − Flight plan commands (i.e. mod) − Flight plan lists frozen with data at time of failure − Reception Queues − Message transmission − Coordination messaging − Mail box management − Resectorisation − SSR code management − AIS (only data available at the time of failure) − All correlation will be lost

Appendices

334

Failure of the Surveillance Network

ATCO actions:

− Inform Coordinator − Inform all traffic − Employ procedural control techniques (if necessary

utilise emergency vertical separation of 500 feet) − Utilise holding patterns − Deny departures − Maintain timely & accurate strip marking − Instruct aircraft to maintain VMC, if in VMC − Reduce traffic load ASAP − Seek assistance − Relocate to contingency site if required

Expect

All ODS frozen or blanked throughout the Centre

Appendices

335

Failure of COMPAD

ATCO actions:

− Inform Coordinator − Transmit on second sector COMPAD − Access RBS and inform traffic of failure − Reset COMPAD − Seek assistance and relocate to spare CWP − Inform traffic of restoration of normal service when

service is restored

Expect:

Complete or Partial failure

Inability to transmit on RTF

Inability to access alternate RTF

Inability to use intercoms

Inability to access telephone network

Appendices

336

Reduced Radar Data Mode

GIW will show “MRTS”

ATCO actions:

− Inform Coordinator − Report failure − Operate as normal

Expect:

All functions are available

The switch to RFS (MRTS) from ARTAS is automatic

Any position in by-pass before ARTAS failure will remain

in by-pass

Appendices

337

Reduced Alert Mode

GIW will show “SNMAP”

ATCO actions:

− Inform Coordinator − Be aware of restricted, danger and prohibited airspace inc. TSA’s

− Check MSA’s at regional airports − Double and cross check Oceanic Entry COP’s and levels − Maintain timely & accurate strip marking − Utilise strategic traffic plans − Ensure tactical ATCO action is accurate − Employ TRM best practice − Continuously scan Mode C − Seek SAS assistance if necessary

Expect:

Any alert displayed prior to the reduced alert mode will remain displayed regardless of whether or not the alert is still valid.


− Safety Net Function (STCA) − ATC Tools (MSAW and APW) − Monitoring Aids (RAM and CLAM) − Coupling − No APR sent to Flight Data function (no profile updates)

Appendices

338

Reduced Flight Plan Mode

GIW will show “FDP”

ATCO actions:

− Inform Coordinator − Check availability of FDP function on spare ODS − Inform traffic of failure − Maintain timely & accurate strip marking − Use verbal coordination channels inter sector/ centre − Identify all new tracks using the “Confirm Squawk”

technique − Maintain identification by regular checks − Restrict traffic flow where necessary − Utilise holding patterns − Be aware of unreliable Safety Nets and Monitoring Aids − Seek SAS assistance where necessary

Expect:


− Flight Plan tracks − Tracks already displayed will remain displayed − Flight Plan commands (i.e. mod, terminate) − Message queues − Message transmission − Coordination messages − Mailbox management − Resectorisation − Limited Safety Net and Monitoring Aids due no update

of the flight plans

Appendices

339

Reduced Communication Mode

GIW will show “FDX”

ATCO actions:

− Inform Coordinator − Use only verbal inter-centre coordination channels − Inform all traffic on RTF − Seek FDA assistance for AFTN or AIS information − Maintain timely & accurate strip marking − Seek SAS assistance where necessary

Expect:


− Inter centre communications − AFTN − Coordination messages (except inter sector) − Flight plans are not updated by external messaging − AIS

Appendices

340

Radar Data Function failure

ATCO actions:

− Inform Coordinator − Select radar by-pass services

Expect:

No radar data function (neither ARTAS nor MRTS nor RFS)

341

Appendix IV The questionnaire design

Air Traffic Controller Questionnaire

Dear Sir/Madam, This questionnaire is created for the purpose of obtaining information on equipment failures and recovery in Air Traffic Control (ATC) System(s) from various standpoints. The information you provide will be used in a research project jointly supported by EUROCONTROL Experimental Centre and Imperial College London. We would greatly appreciate your completing of the attached questionnaire. It will only take a few minutes of your time to answer the questions which will contribute to our joined effort to introduce more real experience into ATC safety analysis. Data collection intends to support recovery strategies of future ATM and analyse the current status on this issue. The information that you provide will be used as additional data source for the PhD dissertation developing in this area. The questionnaire is created in Microsoft® Word 2000. It is our intention to enable you to fill it out electronically and directly send it directly to the following e-mail address ([email protected]). However, if it is more convenient you can use the fax number provided below. Generally there are two formats of the questions, which require different way of answering. For some questions you will have to choose the most appropriate answer by highlighting it, marking it (e.g. yes/no answers), while for the others you will have to type in your full answer. Please, fill out your questionnaire and try to answer the questions as detailed as possible. Your answers will be strictly confidential and de-identified, thus your personal details will not appear in any document connected to this research. Thank you in advance for your time and effort.

Sincerely, Branka Subotic

Research PhD student Imperial College London Centre for Transport Studies London SW7 2AZ

Phone +44 (0)2075946 022 Fax +44 (0) 2075946 102

[email protected]

Appendices

342

Air Traffic Controller Questionnaire

1. Total number of years active as a controller ____________

2. Please list the types of facilities that you have worked in, beginning with the most recent.

ATC Facility Name (beginning with the

most recent) Location Country

Number of years worked in particular

Unit

Type (Civilian/ Military)

Position/Rating ACC/RDR, ACC/PROC,

APP/RDR, APP/PROC, TWR or

ARTCC, TRACON, ATCT (USA)

3. Have you ever experienced ATC equipment failure during your work? Mark the corresponding letter. (If ‘No’ go to question 10) Y N

4. What is the average number of ATC equipment failures during one year that you experience? _________________________

Appendices

343

5. Please fill in any previous experience with equipment failures which seriously impacted your work:

* Page: 343 Context is defined as any aspect of the operating context that influenced the failure or recovery aspect (e.g. workload, HMI, personal factors, team factors).

Note: The typical CWP (controller working position) contains one or more of the following systems (systems will vary from one center and country to another):

• Radar (SSR, PRS, Mode S, radar data processing (RDP), multi-radar processing (MRP), single radar processing (SRP))

• Ancillary screens (meteorological information, strip bay, traffic flow information, etc.) o Flight Plan Processing (FPP) o Flight Progress Strips (FPS)

• Pointing devices (mouse & trackball)

• Secondary input devices (keyboard or touch input device (TID))

Type of equipment

failure

System affected? (See Note

below)

Frequency of the failure per

year (in your own experience)?

Did you detect it

and how?

If not, who

detected it?

Duration of the failure

min, h, days (If you can

recall)?

Was the context* of the failure an

important factor? If yes, has it positive or

negative impact?

Recovery/ contingency

procedure existed or

not?

Recovery/ contingency training existed or

not?

Who initiated

the recovery?

How was the

recovery initiated?

Any additional comment

Appendices

344

• Communication panel

• R/T, telephone, headset, intercom

• Strip printer

• Ground based Safety Nets (SNET): STCA, MSAW, APW, or any other SNET available

• Other (e.g. power supply)

6. How much do you generally rely upon the written procedures in case of equipment failure and how much on situation-specific problem solving (i.e. improvisation)? Fill in the corresponding number for Procedures, Problem solving, AND Other.

1 (very much) 2 3 (moderately) 4 5 (not at all)

Written procedures


Other (e.g. past experience)

7. Is there any organized exchange of the past experience in solving the equipment failures with your fellow colleagues?

Y N

8. If yes, is it supported by your management as a good work practice? Y N

9. According to your experience, what are the three most unreliable ATC systems/subsystems? Please use the device listing from the Note above to state those systems starting with the most unreliable one:

(Note: Reliability is defined in this questionnaire as the probability that a piece of equipment or component will perform its intended function without failure over the given time period and under specific or assumed conditions)

Appendices

345

Following questions should be answered in relation to your current job, position, and level of experience (the first one cited in the question 2).

Procedures

10. Are recovery/contingency procedures available? Mark the corresponding letter. Y N

11. Which types of equipment failures (outages) are covered by procedures in your Center?

12. Are recovery/contingency procedures up-to-date? Y N

13. Are recovery/contingency procedures comprehensive? Y N

14. Are recovery/contingency procedures complete? Y N

15. If not, which procedure(s) would you add?

16. Are recovery/contingency procedures understandable? Y N

17. Are recovery/contingency procedures easily accessible? Y N

18. Are recovery/contingency procedures realistic/feasible? Y N

19. Are recovery/contingency procedures compatible with other procedures? Y N

Appendices

346

20. Describe the situation when you had a problem applying the recovery/contingency procedure and why?

Training

21. Is training provided in recovery from equipment failures? Y N

22. Is there separate refreshment training every year? Y N

23. If provided, how many times per year?

24. Is it enough? Y N

25. Does the training covers all important equipment failures? Y N

26. If not, what should be added?

27. Are training methods suitable (realistic, varied, etc)? Y N

28. Is recovery/contingency training compatible with and linked to other training? Y N

Appendices

347

Conclusion

29. Please write down any other comments or suggestions based on your past experience or professional opinion that you might have on the issue of equipment failures, recovery/contingency procedures, or training.

Thank you for taking the time to answer these questions. Your time and participation are greatly appreciated.

--End--

Appendices

348

Appendix V Example of one questionnaire response

Appendices

349

Appendices

350

Appendices

351

Appendices

352

Appendices

353

Appendices

354

Appendix VI Results extracted from question 5 of the questionnaire survey

The question 5 aimed to provide an opportunity to controllers to discuss their past

experience with equipment failures which seriously impacted on their work. In order to

provide a structured description of each example and extract all relevant information,

question 5 was presented in the form of a table. The rows dealt with different failure

types while the columns dealt with various failure characteristics. These failure

characteristics were as follows:

1. Type of equipment failure and system affected (assessed in section 6.7.3.3

of Chapter 6);

2. Frequency of failure per year;

3. Individual who detected the failure;

4. Duration of the equipment failure;

5. Importance of the recovery context;

6. Existence of recovery procedure for a particular failure (assessed in Table

6-3, Chapter 6);

7. Existence of training for recovery for a particular failure;

8. Individual who initiated the recovery and method applied; and

9. Concluding remarks.

1. Frequency of failure per year

The frequency of failure experienced by controllers was not possible to extract in 27.20

percent of cases. This was partially due to missing responses but mostly due to vague

and unclear responses (e.g. very often, rare). The available and pre-processed data

show that the frequency of failures per year is on average more than 14, ranging

between less than once per year to as many as 730 annually (or twice per day). The

great dispersion of data confirms different interpretation of equipment failures (as

discussed in section 6.7.3.1 of Chapter 6).

2. Individual who detected the failure

The failures were detected most frequently by controllers (in 79.4 percent of examples)

and with the assistance of the system-generated failure alert (in 7.1 percent of

examples). Other cases include failure detection by watch supervisors, engineers,

pilots, or controllers from other ATC Centres (in the case of a failure affecting national

or regional airspace, such as failure of satellite communication, flight data processing

Appendices

355

system, or radar). These findings are expected as NATS (2002) reports that most

failures do not affect the controllers as these are prevented or recovered by system

control and monitoring unit. Moreover, the results obtained from this questionnaire

survey emphasise that the prompt detection of any ATC system deficiency depends

mostly on the controller, as a direct result of the controller’s situational awareness.

Furthermore, the results show that failure detection may be aided by system-generated

failure alerts. This is an example of the synergy that exists between technical and

controller recovery achieved through the technical built-in defences for transmitting

information on failure (discussed in Chapter 4, section 4.3.2). These technical systems

will demonstrate more potential in the future, highly integrated ATC environment.

3. Duration of the equipment failure

Similar to the frequency variable, it was not possible to extract the duration of failures in

27.20 percent of examples. This was expected due to the difficulties with recalling the

duration of past failures. Additional problems were encountered with vague qualitative

responses (e.g. several days, a couple of hours, a few minutes). The available and pre-

processed data show that the average duration of the reported failures was close to

one day, ranging from five minutes to one month. The large dispersion indicates

different durations for different types of failures.

The same categorisation of duration variables is applied as previously with the

operational failure reports (see Chapter 4, section 4.4.6). More precisely, the

categorisation focused on failures up to 15 minutes, between 15 minutes and one hour,

between one hour and one day, and those lasting more than one day. It is interesting to

note that distribution of duration from operational failure reports and from past

experience captured in this survey show similarities (Figure 1). The difference is

observed in the third category (duration from one hour to one day). It seems that in the

operational environment, equipment failures of this duration tend to occur more

frequently compared to the experience of controllers worldwide.

Appendices

356

(>24.01][1.01-24.00][0.26-1.00][0.00-0.25]


100

80

60

40

20

0

Fre

qu

en

cy

7.23%

19.15%

31.06%

42.55%

a)

[>24.01][1.01-24][0.26-1][0.00-0.25]


3,000

2,500

2,000

1,500

1,000

500

0

Fre

qu

en

cy

8.04%

31.6%

25.85%

34.51%

b)

Figure 1 Distribution of the duration variable a) from the questionnaire survey; b) from the Country D operational failure reports (see Chapter 4)

4. Importance of the recovery context

When asked about the context surrounding the occurrence of an equipment failure, the

controllers acknowledged its importance in the majority of examples (73 percent of

examples). Furthermore, these controllers rated its impact mostly as negative (63.9

percent of examples). The negative issues mentioned regarding the context of the

equipment failures were reduction of capacity, increased workload, increased stress,

increased communication with aircraft, increased coordination with adjacent sectors,

and in some cases additional workload due to deterioration in the weather. However,

Appendices

357

there were several instances in which controllers rated context as positive mostly

through efficient teamwork, availability of an efficient assistant, low traffic levels at the

time of occurrence (i.e. no significant increase in workload), and ability to work with

fallback systems. As a result, the importance of context identified in past research is

confirmed in this questionnaire survey. The following Chapters are dedicated to further

assessment of recovery context.

5. Existence of training for recovery for a particular failure

Question 5 allowed mapping between ATC functionalities and available recovery

training for the sampled equipment failures1. The analysis showed that in 48 percent of

examples provided, the controllers had some type of recovery training. This training

was mostly provided for the communication, navigation, surveillance, and data

processing functions. Lack of training is identified for power outages and loss of safety

nets.

6. Individual who initiated the recovery and method applied

The individuals that initiated and applied recovery processes came predominately from

the controller population when compared with watch managers and engineers. This is

understandable as section 2 pointed out that most equipment failures are detected by

controllers. Having detected a problem with equipment, the controllers have to inform

engineers, indirectly through the watch manager, which constitutes the initiation of the

recovery. In some simple cases (e.g. loss of microphone and loss of screen), the

controller tries to replace the failed equipment either by using the spare one or by

changing to another working position (if there are any spare ones). In more complex

situations, when a change of position is not possible, the controller has to continue

working with the remaining tools and equipment and potentially revert to procedural

control, assure vertical separation, use fallback systems, and/or transfer all flights to an

adjacent sector or flight information region. Engineers initiate the recovery process in

the case of failures of aeronautical data exchange with adjacent ATC Centres,

runway/taxiway lighting systems, and data processing system. However, the controller

still remains responsible for safe separation of all traffic in the affected airspace.

1 Question 26 although intended to capture the type of recovery training missing in each

sampled ATC Centre yielded mostly high-level comments on impossibility to train for every potential equipment failure.

Appendices

358

7. Concluding remarks

In general, the controllers’ perceive equipment failures as stressful and distracting

events that pose a major safety problem due to increased workload and difficulties with

maintaining identification of aircraft (e.g. in case of radar failure and data processing

failure). In one particular instance a controller commented that an equipment failure led

to a near miss. Another example pointed out the problems with equipment failures

occurring during night shift, as technical staff are not always available during that

period.

Appendices

359

Appendix VII Overview of contextual factors

Factor HERA

Eurocontrol HERA [12]

TRACEr Shorock and Kirwan [19]

RAFT Eurocontrol

[20]

THERP Swain and Guttman [24]

COCOM Hollnagel

[27]

CREAM Hollnagel [11]

External PSF Stressors Internal

PSF

1 Pilot-controller comm.

Pilot-controller comm.

Pilot-controller comm.

Written and verbal communication

2 Pilot actions

3 Traffic and airspace

Traffic and airspace Task load and system complexity

Complexity; Requirements for perception; requirements for motor speed

Task speed; Task load

4 Weather

5 Documentation and procedures

Procedures Procedures and documentation

Required procedures; Work-methods; Plant policy

Plans Availability of procedures/ plans

6 Training and experience

Training and experience

Training and experience

Prior training, experience

Normal/familiar process state

Adequacy of training and experience

7 Workplace design and HMI

Workplace design, HMI, and equipment factors

Human machine interaction

Design features; Factors in task and work resources; Warnings and danger signs; Man-machine factors; Interface

Inconsistent labelling

MMI and support

Adequacy of MMI and operational support

8 Environment Ambient environment

Quality of environment; T; Air quality; Situational factors

Detractors; Extreme T; radiation; Pressure; Inadequate oxygen supply; Vibration; Restricted movements

Working conditions

9 Personal factors Personal factors Personal factors

Perception; Motor system; Memory; Decision-making; Short-term and long-term memory

Duration of stress; Pain; Thirst; Fatigue; Threats; Monotony; Work performance; Circadian rhythm

State of momentarily abilities personality and intelligence; motivation and attitudes; emotional state; stress; gender

Time of the day (circadian rhythm)

10 Team factors Social and team factors

Social and team factors

Attitudes deriving from family or groups; group dynamic processes

Crew collaboration quality

11 Organisational factors

Organisational factors

Other organisational factors, Logistical factors

Organisational structure; Working hours; Actions by shift leader, manager; Remuneration structure

Adequate organisation


12 Few simultaneous goals

Number of simultaneous goals

13 Suddenness of occurrence

Available time Available time

14

Appendices

360

Factor HRMS

Kirwan [28]

Recovery from Failures

Kanse and van der Schaaf [21]

CORE-DATA Eurocontrol

[13]

ATHEANA U.S. NRC

[29]

CAHR Straeter [16]

NARA Kirwan et al.

[30]

HPDB Park et al.

[32]

1 Communication

2

3 Task organisation & Task complexity

Task complexity & Task criticality & Task novelty

Task preparation; Task simplicity; Complexity of the task; Precision; Monotony of activity

Dependencies of the different tasks/steps/actions

4

5 Procedures Procedures

Clarity/Precision of procedures; Design of procedures; Content; Completeness; Presence

Shortfalls in the quality of information conveyed by procedures; use of more dangerous procedures

Available procedure & description of all steps and tasks

6 Training/expertise/experience/competence

Person related factors Refresher training & Training

Inexperience

Operator inexperience; Unfamiliarity (situation occurs infrequently)

Level of experience

7 Quality of information/ interface

Technical/workplace/situational factors

Ergonomic design & HMI ambiguous & HMI feedback; Alarms; Labels

Unfamiliar plant conditions

Usability of control; Usability of equipment; Positioning; Equivocation of equipment ; arrangement of equipment; display range; accuracy of display; Labelling; Marking; Reliability; Technical layout; Construction; Redundancy; Coupled equipment

Low signal to noise ratio; Overriding information easily accessible; no means to reverse an unintended action; Poor system feedback; Poor system feedback on activity progress

8 Technical/workplace/situational factors

Environmental factors and ergonomics

External event Poor environment

9 Person related factors Stress; Workload

Human performance capabilities at low point; Excessive workload

Processing; Information; Goal reduction

Operator under load/boredom; A conflict between intermediate and long-term objectives; Stress and ill-health; Information overload

Person issues; Demand of perception, cognition, etc.

10 Task organisation Social factors Poor handovers and team coordination problems

Team issues

11 Organisational factors Lack of supervision/checks

Non-optimal use of human resources

Low workforce moral or adverse organisational environment

12

13 Time Factors relevant for prioritisation of recovery-related factors

Time pressure Time constraints Time pressure Time pressure

The time needed to correctly perform tasks, steps, and actions

14 Occurrence-related factors

Appendices

361

Appendix VIII Probabilities for 20 Recovery Influencing Factors (RIFs)

The relevant Recovery Influencing Factors (RIFs) are discussed in the four main

groups: internal factors (i.e. related to the controller), equipment failure related factors,

external factors (i.e. factors related to working conditions), and airspace related factors.

The following paragraphs present the underlying considerations in developing the

probability values for each predefined RIF.

A.1 Internal factors

Internal factors represent a group of RIFs closely related to the air traffic controller.

These include quality of training, controller experience with equipment failures in

his/her professional career, experience with (or trust in) the ATC system, generic

assessment of personal factors (e.g. personality, fatigue, stress), and communication

for recovery as a result of detected equipment failure.

A.1.1 Training for recovery from ATC equipment failure

This factor describes the adequacy of training provided in recovery tasks based on the

existing recovery procedures and/or other ATC Centre specific equipment failures,

frequency of refresher training (e.g. once per year), and familiarity with ATC system

operational modes (ranging from full, through reduced/emergency, to failed operation).

The qualitative descriptor and the corresponding probabilities are determined from the

questionnaire survey responses based on percentages of ATC Centres that provide

training for recovery, those that provide this training but not consistently, and those that

do not provide any training for recovery (see Chapter 6, section 6.7.3.6 and Chapter 8,

section 8.3.1.2). The qualitative descriptor and the corresponding probabilities for this

RIF are presented in Table 1.

Table 1 Summary of the RIF ‘Training for recovery from ATC equipment failure’

RIF Qualitative descriptor

Data source for

probabilistic assessment

Number of responses

Percentage of

responses

RIF probability

Nature of the

validation

Training for recovery from ATC equipment

failure

suitable The

questionnaire survey

134

52 0.52

- tolerable 17 0.17

counter productive

31 0.31

Appendices

362

A.1.2 Previous experience with equipment failures

This factor describes the overall level of controller experience with equipment failures,

as well as the level of experience with a particular type of failure under assessment.

The qualitative descriptor is set at two levels (controllers can either have experience

with equipment failures or not), while the probabilities are determined from the

questionnaire survey, further validated by the responses from the ATM specialists

surveyed (Table 2).

Table 2 Summary of the RIF ‘Previous experience with equipment failures’


Data source for


Number of responses

Percentage of

responses

RIF probability

Nature of the

validation

Previous experience

with equipment

failures

experienced any type of equipment

failure The questionnaire

survey 134

95 0.95

ATM specialists surveyed

no experience

with equipment

failures

5 0.05

A.1.3 Experience with system performance (reliance or trust in the system)

This dynamic factor describes the overall level of experience of the controller with the

ATC system including the tools and subsystems on the ATC console. The use of

automated tools depends upon the controllers’ trust in their reliability. The extreme

situations of undertrust or overtrust may lead to problems. The former may result in the

tool not being used and the latter, in the over reliance of the controller on the tool

available. The probabilities are determined from the findings of the study by Hilburn

and Flynn (2001) also reported in EUROCONTROL (2000b), which involved a total of

79 controllers from seven European ATC Centres. This study used both focus group

discussions and survey data collections to extract controllers’ attitudes to future

automation needs, system development issues, and operational requirements. The

results showed that 18 percent of controllers sampled mistrust technology. On the

other hand, the responses from the ATM specialists surveyed in this thesis reveal that

10 percent of controllers have excessive trust in the system. Taking mistrust and

excessive trust together, the qualitative descriptor for this RIF is set at two levels and

the corresponding probabilities are shown below (Table 3).

Appendices

363

Table 3 Summary of the RIF ‘Experience with system performance’


Data source for


Number of responses

Percentage of

responses

RIF probability

Nature of the

validation

Experience with system performance (reliance or trust in the

system)

objective attitude

toward the ATC

system

Past research and ATM

specialists

79/8

72 0.72

-

excessive trust and mistrust

28 0.28

A.1.4 Personal factors

These are controller-related factors, which can be determined in a post-failure analysis

or predicted in the case of predictive analysis. This factor includes, but it is not limited

to, the following: time of the day (i.e. relevance of circadian rhythm), time into the shift

(i.e. level of situational awareness as well as fatigue), and age. Although other factors

are important, for example, the level of confidence, complacency, self-esteem (i.e. trust

in own ability), personality, motivation, attitudes deriving from family or close social

groups, and ability to cope with stress, they require the application of various sets of

psychological tests. Current definition of the personal factors accounts for all the above

mentioned factors and sets the qualitative descriptor at three levels. The respective

probabilities are determined from the average of the responses from the ATM

specialists surveyed (Table 4).

Table 4 Summary of the RIF ‘Personal factors’


Data source for


Number of responses

Percentage of

responses

RIF probability

Nature of the

validation

Personal factors

suitable

ATM specialists

8

65 0.65

- tolerable 26 0.26

counter productive

9 0.09

A.1.5 Communication for recovery within team/ATC Centre

This factor includes only the communication that takes place between controllers for

the purpose of recovery from equipment failure. Therefore, it assesses the quality of

communication as well as the decision-making process, quality of Team Resource

Appendices

364

Management (TRM)2, familiarity of team members or the level of synergy between

them, the level of mutual understanding and the knowledge of different working

strategies, team efficacy, intent recognition (i.e. overt communication), and other items.

In the case of a single-controller position this factor should be understood as a

communication with a supervisor or any other relevant personnel. The qualitative

descriptor is proposed at three levels while the corresponding probabilities are

determined from the average of the responses from the ATM specialists surveyed

(Table 5).

Table 5 Summary of the RIF ‘Communication for recovery within team/ATC Centre’


Data source for


Number of

responses

Percentage of

responses

RIF probability

Nature of the

validation

Communication for recovery

within team/ATC

Centre

efficient

ATM specialists

8

73 0.73

- tolerable 24 0.24

inefficient 4 0.04

A.2 Equipment failure related factors

Equipment failure related factors represent a group of RIFs defining the characteristics

of failures relevant to the controller recovery process. These are complexity of failure

type, time course of failure development, number of workstations/sectors affected, time

necessary to recover, existence of recovery procedure, and duration of failure. Details

on failure characteristics can be found in Chapter 4.

A.2.1 Complexity of failure type

This factor identifies single versus multiple component failures (as discussed in

Chapter 4) and thus the qualitative descriptor is proposed at two levels. The

probabilities of each level are determined using the operational failure reports from

available Civil Aviation Authorities (Table 6). Due to the relatively low level of

confidence in the use of CAA occurrence databases (see Chapter 8, section 8.3.1.5),

these probabilities were validated by the responses from the ATM specialists surveyed

which did not show a significant difference. Additionally, these results are in line with

the experience of system control and monitoring engineers interviewed for this study

2 TRM represents an effective use of all available resources for ATC personnel to assure safe

and efficient operation, to reduce error, avoid stress, and increase efficiency.

Appendices

365

who stated that the majority of ATC equipment failures represent single as opposed to

multiple failure occurrence (for evidence see Appendix II).

Table 6 Summary of the RIF ‘Complexity of failure type’


Data source for probabilistic

assessment

Number of

responses

Percentage of

responses

RIF probab

ility

Nature of the

validation


a single failure


22,808 reports

92 0.92

ATM specialists responses and system control and monitoring engineers

multiple failure

8 0.08

A.2.2 Time course of failure development

This factor defines the temporal characteristics of failure occurrence. These are

sudden, gradual, and latent/persistent failures. As a result, the qualitative descriptor is

set at three levels: sudden failure/gradual degradation of system/persistent or latent

failure. Based on the averaged responses from the ATM specialists surveyed the

corresponding probabilities are presented in Table 7. These probabilities were

validated by the interviews with system control and monitoring staff from several ATC

Centres which did not show a significant difference (for evidence see Appendix II).

Table 7 Summary of the RIF ‘Time course of failure development’


Data source for


Number of responses

Percentage of

responses

RIF probability

Nature of the

validation

Time course of failure

development

sudden ATM

specialists responses

8

55 0.55 System control and monitoring engineers

gradual 39 0.39

latent 7 0.07

A.2.3 Number of workstations/sectors affected

This factor describes the immediate impact of a particular type of failure in terms of the

number of positions/sectors affected. It is closely linked to the overall ATC Centre

architecture, since exposure to failure varies greatly with the level of interconnectivity of

different systems, the level of availability of separate channels (redundancy/variability),

and complexity of failure (single vs. multiple failure). The qualitative descriptor is

proposed at two levels, differentiating between a failure affecting a single and multiple

Appendices

366

Controller Working Positions (CWPs) and sectors. Due to the lack of operational data,

a conservative approach is taken and probabilities are equally assigned between two

levels. Note that this RIF has no Level 1, i.e. the most favourable level, simply because

the number of workstations/sectors affected cannot have any positive or favourable

effect on controller performance (Table 8).

Table 8 Summary of the RIF ‘Number of workstations/sectors affected’


Data source for


Number of

responses

Percentage of

responses

RIF probability

Nature of the

validation

Number of workstations/

sectors affected

one CWP or several CWPs in a

sector N/A

50 0.5

- several CWPs in

several sectors/all CWPs in all sectors

50 0.5

A.2.4 Time necessary to recover

This factor describes the time necessary for a controller to recover from the effect(s) of

equipment failure. This time should be measured from the moment of failure

occurrence until the establishment of a normal or stable system state (i.e. assurance of

safe but not necessarily efficient control of air traffic). The qualitative descriptor is set at

two levels, differentiating between availability and lack of time to recover, while the

corresponding probabilities are determined from the average of the responses from the

ATM specialists surveyed (Table 9).

Table 9 Summary of the RIF ‘Time necessary to recover’


Data source for


Number of responses

Percentage of

responses

RIF probability

Nature of the

validation

Time necessary to recover

less than time

available3 ATM

specialists 8

94 0.94

- in excess

of time available

6 0.06

3 Time available to controller to react before the development of less than adequate separation.

Appendices

367

A.2.5 Existence of recovery procedure

This factor takes into account the availability of a written procedure, rules, or guidelines

for a particular type of equipment failure, the level of its comprehensiveness and

completeness. In future this RIF may even include the existence of some sort of a

dynamically adaptable procedure. The qualitative descriptor is set at three levels to

capture the quality of the existing procedure (Table 10). Probabilities are calculated

based on the findings from the questionnaire survey responses which showed that 13.8

percent of ATC Centres do not have any recovery procedures. The distinction between

suitable and tolerable procedures was acquired taking into account that 45 percent of

existing procedures are not complete, and therefore only tolerable. It should be noted

that this approach is limited as it associates incomplete procedures with tolerable

procedures. A more accurate approach is achievable when the proposed methodology

is applied to a specific equipment failure and its context.

Table 10 Summary of the RIF ‘Existence of recovery procedure’


Data source for


Number of

responses

Percentage of

responses

RIF probability

Nature of the

validation

Existence of recovery

procedure

suitable The

questionnaire survey

134

47 0.47

- tolerable 39 0.39

inappropriate4 14 0.14

A.2.6 Duration of failure

This particular factor represents the amount of time during which a failure persists.

Applied to a specific system, it can carry important information on recovery and the

impact of particular failure on ATC and overall aviation safety. A discussion of the

duration of failures informed by the results of the operational failure report analysis

informed the qualitative descriptor, proposed at two levels. The corresponding

probabilities are determined from the operational failure reports (Chapter 4), further

validated by the responses from the ATM specialists surveyed which did not show a

significant difference (Table 11).

4 If procedures are not available, ‘Inappropriate’ would be used.

Appendices

368

Table 11 Summary of the RIF ‘Duration of failure’


Data source for


Number of

responses

Percentage of

responses

RIF probability

Nature of the

validation

Duration of failure

short period of time (up to 15minutes)


22,808 (reports)

56 0.56 ATM

specialists surveyed

moderate to substantial period of time (failures longer

than 15 minutes)

44 0.44

A.3 External factors

External factors or factors related to working conditions represent the group of RIFs

related to the working conditions surrounding a controller at the moment of failure.

These are adequacy of HMI, operational support, quality of alarms/alerts and the

moment when they are triggered in the system, and the overall adequacy of the

organisational characteristics in an ATC Centre from the safety and operational

perspectives.

A.3.1 Adequacy of HMI and operational support

This factor includes the HMI and all available control panels (e.g. mode of operation,

radars in use, frequencies in use and dynamic flight information), situational display, as

well as the operational support provided by specifically designed decision aids. It is

important to highlight that a controller receives the entire feedback on the ATM system

performance through the HMI. The qualitative descriptor is set at three levels to capture

the quality of the HMI, while the probabilities are determined from the average of the

responses from the ATM specialists surveyed (Table 12).

Table 12 Summary of the RIF ‘Adequacy of HMI and operational support’


Data source for


Number of

responses

Percentage of

responses

RIF probability

Nature of the

validation

Adequacy of HMI and

operational support

suitable

ATM specialists

8

53 0.53

- tolerable 45 0.45

counter productive

3 0.03

A.3.2 Ambiguity of information in the working environment

This dynamic factor describes the transparency of the system, the level of system

interaction and redundancy, and existence of symptoms that can be interpreted in more

Appendices

369

than one way. In general, it is observed that a lack of transparency of an ATC system

leads people to make hypotheses on the causes of failures based on incomplete

information or best guess (see Straeter, 2005). ATC subsystems are highly dependent

on each other. Information from one tool can be distributed to several different

subsystems at the same time. For example, information on aircraft position is sent

directly to the radar data processing system, air traffic flow management, ATC tools

(including the monitoring aid and the medium term conflict detection tool), safety nets

(e.g. the short term conflict alert tool), and flight data processing system. In other

words, ATC systems are closely coupled and dependant upon dynamic information

exchange. For this reason the architecture of any ATC Centre takes into account

existing interactions by building a net of redundancies. In addition, any symptoms that

can be interpreted in more than one way will be interpreted wrongly in some instances.

Based on the above discussion, the qualitative descriptor are set at two levels whilst

the corresponding probabilities are determined from the average of the responses from

the ATM specialists surveyed (Table 13).

Table 13 Summary of the RIF ‘Ambiguity of information in the working environment’


Data source for


Number of

responses

Percentage of

responses

RIF probability

Nature of the

validation


the match between the external

working environment and the controller's internal mental

model ATM specialists

8

86 0.86

- the mismatch between the

external working environment and the controller's internal mental

model

14 0.14

A.3.3 Adequacy of alarms/alerts

As explained in Chapter 4, the function of alarms/alerts is to alert operators (visually

and/or auditory) to potential non-nominal system states. The role of the human

operator is then to confirm the existence of a failure and take appropriate actions.

Because of the complexity of current ATC consoles, it is believed that the availability,

adequacy of alerts, and other relevant characteristics should be considered separately

from HMI. Therefore, this factor describes the availability and adequacy of

Appendices

370

alarms/alerts which permit detection, diagnosis, and/or correction of failures, the

reliability of given information, the number of alerts presented to the controller, and the

appropriate location and format of alert information (e.g. signal, colour coding,

warning/message). The qualitative descriptor is set at three levels, to account for

suitable tolerable and inadequate design solutions, while the probabilities are

determined from the average of the responses from the ATM specialists surveyed

(Table 14).

Table 14 Summary of the RIF ‘Adequacy of alarms/alerts’


Data source for


Number of

responses

Percentage of

responses

RIF probability

Nature of the

validation


suitable

ATM specialists

8

75 0.75

- tolerable 20 0.2

counter productive

5 0.05

A.3.4 Adequacy of alarm/alert onset

This dynamic factor describes one important characteristic of the available

alerts/alarms, namely the ‘cognitive convenience’ of alert onset. In other words, alert

onset has a high impact on the overall recovery performance depending on the

moment of its onset. In addition, a misleading sequence of alerts can lead the controller

towards wrong assumptions with a cognitive tunnelling based on the initial alert,

thereby disregarding a later, possibly more relevant alert (Straeter, 2005). Since the

adequacy of alert onset depends directly on the complexity of traffic in the dedicated

airspace (dynamically changing every second), this RIF is given two levels.

Furthermore, due to the lack of ATC operational data on this advanced and futuristic

concept, a conservative approach is taken and probabilities are equally assigned

between two levels (Table 15).

Appendices

371

Table 15 Summary of the RIF ‘Adequacy of alarm/alert onset’


Data source for probabilistic assessment

Number of responses

Percentage of responses

RIF probability

Nature of the validation

Adequacy of

alarm/alert onset

information from the external world enters the processing loop at

the right time

N/A N/A

50 0.50

- information from the external world enters the processing loop at

the wrong time, i.e. misleading alarm or sequence of alarms

50 0.50

A.3.5 Adequacy of organisation

This factor describes several organisational characteristics of the ATC Centre. These

include but are not limited to the quality of roles and responsibilities, the availability of

team members, the availability and adequacy of supervision, the availability of

additional support (e.g. assistant), the personnel selection process, shift patterns and

personnel planning, attitude to teamwork, safety culture, existence of stress

management programs, support for the organised exchange of past experience on

equipment failures, adequacy of communication with management and technicians

(e.g. briefings, exchange of knowledge, bulletins, safety panels). Three qualitative

descriptors can be distinguished with probabilities determined from the average of the

responses from the ATM specialists surveyed (Table 16).

Table 16 Summary of the RIF ‘Adequacy of organisation’


Data source for


Number of

responses

Percentage of

responses

RIF probability

Nature of the

validation


efficient

ATM specialists

8

67 0.67

- tolerable 31 0.31

inefficient 3 0.03

A.4 Airspace related factors

Airspace related factors relate to the characteristics of the airspace affected by the

degraded system performance, traffic complexity at the moment of failure and during

the recovery process, and weather conditions. In addition, this group includes the

overall task complexity of the situation. For example, an equipment failure occurrence

coupled with sudden increase in amount of traffic, sudden deterioration of weather, or

the existence of priority aircraft highly increase the complexity of the overall situation.

Appendices

372

A.4.1 Traffic complexity during the recovery process

This dynamic factor includes but is not limited to the following: the level and

characteristics of the traffic load, the mix of aircraft flying on instrument flight rules (IFR)

and visual flight rules (VFR), military aircraft (because of different performance

characteristics and speed differentials), the existence of priority aircraft (e.g. low fuel,

government flights, and medical emergency). There have been various studies into

traffic complexity (Hilburn, 2004) and various attempts to provide a quantitative

indicator of traffic complexity; for example using dynamic density (Kopardekar and

Magyrtis, 2003), cross-sectional time-series analysis methods (Majumdar et al., 2004),

and the use of traffic complexity indicator (EUROCONTROL, 2006c). Any of these

approaches may be used to inform the probabilities for the qualitative descriptor of this

particular RIF. Taking into account only the impact that traffic complexity may have on

the controller performance, this qualitative descriptor is proposed at two levels. One

level accounts for average traffic complexity whilst the other accounts for high and low

traffic complexity, as both negatively impact controller performance. The probabilities

are determined from the average of the responses from the ATM specialists surveyed

(Table 17).

Table 17 Summary of the RIF ‘Traffic complexity during the recovery process’



Number of responses


RIF probability


Traffic complexity during the recovery process

High and low traffic complexity

ATM specialists

8

19 0.19

- Average traffic

complexity 81 0.81

A.4.2 Airspace characteristics during the recovery process

This dynamic factor incorporates the characteristics and complexity of airspace (i.e. its

component sectors), based upon the sector design characteristics (for details see

NATS, 1999). These characteristics include the number of crossing points and their

position in relation to sector boundaries, number of flight levels, number of entry and

exit points, special use airspace (SUAs) including zones of military activity,

characteristics of upper vs. lower airspace, airways configuration, and the number of

neighbouring sectors. It is important to highlight the difference between enroute and

terminal airspace in relation to recovery from equipment failures. The terminal airspace

is characterised with traffic in constant level change (i.e. ascending or descending) and

Appendices

373

frequent changes in heading compared to enroute airspace and especially its higher

levels. Due to differences in controller tasks, en-route airspace in general provides

more time to recover compared to terminal airspace. In addition, interviews with ATM

specialists revealed that terminal airspaces have radar coverage provided from one

radar source compared to en-route airspace, which is usually based on multi-radar

tracking (i.e. integration of data from several radar sites). The qualitative descriptor is

set at three levels whilst the corresponding probabilities are determined from the

average of the responses from the ATM specialists surveyed (Table 18).

Table 18 Summary of the RIF ‘Airspace characteristics during the recovery process’


Data source for


Number of

responses

Percentage of

responses

RIF probability

Nature of the

validation

Airspace characteristics

during the recovery process

Adequate

ATM specialists

8

64 0.64

- Tolerable 33 0.33

Inappropriate 3 0.03

A.4.3 Weather conditions during the recovery process

This dynamic factor takes into account any change in weather conditions during the

recovery process. The qualitative descriptor is proposed at two levels whilst the

corresponding probabilities are determined from the responses from the ATM

specialists surveyed (Table 19).

Table 19 Summary of the RIF ‘Weather conditions during the recovery process’



Number of responses


RIF probability



Improved ATM

specialists 8

89 0.89

-

Deteriorated 11 0.11

A.4.4 Conflicting issues during the recovery process (task complexity)

This dynamic factor describes the level of overall task complexity at the moment of

equipment failure. In the case of multiple conflicting tasks, the operator has to prioritise

between them (Straeter, 2005). In the case of any type of conflict alert (i.e. two or more

aircraft having a conflicting intent), the controller has to provide full attention to the

Appendices

374

resolution of the conflict using the equipment which is still operational, but assuming

that some other subsystem might fail. In ATC overall safety is the first priority. Due to

the dynamic nature of ATC, this qualitative descriptor is proposed at two levels, the

average complexity of the situation and both high and low complexity of the situation

(as both have negative effect on controller performance: increased workload and

boredom or monotony, respectively). The corresponding probabilities are determined

from the responses from the ATM specialists surveyed (Table 20).

Table 20 Summary of the RIF ‘Conflicting issues during the recovery process (overall task complexity)’


Data source for


Number of

responses

Percentage of

responses

RIF probability

Nature of the

validation

Conflicting issues during the recovery

process

The average complexity

ATM specialists

8

72 0.72

- Multiple tasks and low

complexity 28 0.28

Appendices

375

Appendix IX Questions for ATM Specialist

Note: The set of questions presented below is investigating controller recovery from

equipment failures in ATC. All questions should be answered based upon your

operational experience and knowledge. Whilst some of them are very specific, and

therefore pose a challenge to answer, please try to respond to all the questions giving

the appropriate percentages.

How often has training (initial & refreshment) in your ATC Centre been:

Suitable for potential equipment failures Tolerable for potential equipment failures Counter productive for potential equipment failures

100%

What is the percentage of ATCOs that have never experienced equipment failure in their career? Please think of novice ATCOs as well and try to make the best estimation.

According to your best judgement, what percentage of ATCOs have:

Over-trust the automation/systems they are using Objective attitude toward ATC automation (ATCOs do trust automation but are aware of possible failures) Under-trust the automation/systems they are using

100%

In the event of equipment failure, how often have personal factors (stress, fatigue, self esteem) been:

Suitable to the equipment failure in question Tolerable to the equipment failure in question Counter productive to the equipment failure in question

100%

How often has team-related communication for recovery been:

Efficient Tolerable Inefficient

100%

What is the percentage of equipment failures affecting:

One system only Multiple systems at the same time

100%

What is the percentage of: Sudden equipment failures Gradual equipment failures Latent equipment failures in your ATC Centre

Appendices

376

100%

How often has the time necessary to recover (time before the development of any inadequate separation) been:

Adequate Inadequate

100%

How often (in your overall experience) have existing recovery procedures been:

Suitable to the equipment failure in question Tolerable to the equipment failure in question Counter productive to the equipment failure in question

100%

What is the percentage of equipment failures lasting:

Up to 15min More than 15min

100%

When there is a failure, how often has information presented on your HMI (i.e. radar screen) been:

Suitable to the recovery from equipment failure (e.g. provides appropriate cues, visual/auditory alerts) Tolerable to the recovery from equipment failure Counter productive to the recovery from equipment failure (e.g. provides wrong cues, mislead you)

100%

When there is a failure, how often have existing alarms/alerts on radar screen been:

Suitable to the recovery from equipment failure Tolerable to the recovery from equipment failure Counter productive to the recovery from equipment failure

100%

According to your opinion, what is the percentage of match between the controller's situational awareness and the dynamic airspace and traffic configuration (traffic mix, speed differentials, FL utilized, airways configuration) during the recovery process?

What percentage of time the organisational features in your ATC centre are:

Efficient Tolerable Inefficient regarding the support for better recovery from equipment failures.

100%

In the event of an equipment failure, how often has the traffic complexity been:

Too high Tolerable Too low

100%

In the event of an equipment failure, how often has airspace design and configuration been:

Adequate Tolerable Inappropriate

100%

In the event of an equipment failure, how often have the weather conditions been:

Improved Deteriorated or worsen Unchanged

100%

Appendices

377

In the event of equipment failure, how often has the total complexity of the recovery situation been:

High Average Low

100%

Appendices

378

Appendix X Overview of RIFs, their corresponding levels, and designated probabilities

(1) (2) (3) (4) (5) (6) (7) (8)

ID RIF name Descriptor Probability

(p)

Expected effect of

controller recovery

performance

Level Designator

(R)

Probability of overall situation occurring

(p*R)

Inte

rnal fa

cto

rs

1 Training for recovery from ATC equipment failure

Suitable to the situation in question

0.52 Most

favourable 1 1 0.52

Tolerable to the situation in question

0.17 Non

significant 2 0 0.00


0.31 Least

favourable 3 -1 -0.31

2 Previous experience with equipment failures

Experienced with a particular type of failure or Experienced with any other type of ATC equipment failure

0.95 Most

favourable 1 1 0.95

No experience with ATC equipment failures

0.05 Non


3 Experience with the system performance (reliance)

Objective attitude toward the system

0.72 Non


Positive experience with the system (excessive trust) or Negative experience with the system (under-trust)

0.28 Least


4 Personal factors

Suitable for the recovery process

0.65 Most

favourable 1 1 0.65

Tolerable for the recovery process

0.26 Non


Counter productive for the recovery process

0.09 Least


5 Communication for recovery within team/ATC Centre

Efficient 0.73 Most

favourable 1 1 0.73

Tolerable 0.24 Non


Inefficient 0.04 Least


Equip

ment

failu

re r

ela

ted facto

rs


Single system affected

0.92 Non



0.08 Least


7 Time course of failure development

Sudden failure 0.55 Improve 1 1 0.55

Persistent or latent failure

0.07 Non


Gradual degradation of system

0.39 Least


8 Number of workstations/sectors affected

One workstation/one sector or All workstations in one sector

0.50 Non


Several workstations/couple of sectors or All

0.50 Least


Appendices

379

workstations/all sectors


Adequate - less than available time

0.94 Most

favourable 1 1 0.94

Inadequate - in excess of available time

0.06 Least




0.47 Most

favourable 1 1 0.47


0.39 Non


Inappropriate 0.14 Least



Short period of time 0.56 Non


Moderate period of time or Substantial period of time

0.44 Least


Exte

rnal or

facto

rs r

ela

ted to w

ork

ing c

onditio

ns

12 Adequacy of HMI and operational support


0.53 Most

favourable 1 1 0.53


0.45 Non



0.03 Least


13



0.86 Most

favourable 1 1 0.86


0.14 Least


14 Adequacy of alarms/alerts


0.75 Most

favourable 1 1 0.75


0.20 Non



0.05 Least


15 Adequacy of alarm/alert onset

Information from the external world enters the processing loop at the right time

0.50 Most

favourable 1 1 0.50

Information from the external world enters the processing loop at the wrong time (misleading sequence of alarms)

0.50 Least


16 Adequacy of organisation

Efficient 0.67 Most favourable

1 1 0.67

Tolerable 0.31 Non significant

2 0 0.00

Inefficient 0.03 Least favourable

3 -1 -0.03

Airspace

rela

ted

facto

rs

17 Traffic complexity

Average traffic complexity

0.81 Non significant

2 0 0.00

Extremely high or extremely low traffic complexity

0.19 Least favourable

3 -1 -0.19

Appendices

380

18 Airspace characteristics

Adequate (e.g. enroute higher levels)

0.64 Most favourable

1 1 0.64


2 0 0.00

Inappropriate (e.g. enroute lower levels or terminal)


3 -1 -0.03

19 Weather conditions during the recovery process

Improved 0.89 Non significant

2 0 0.00

Deteriorated 0.11 Least favourable

3 -1 -0.11

20 Conflicting issues in the situation (task complexity)

Average complexity of the situation


2 0 0.00

Conflicting, multiple tasks or Extremely low complexity of the situation (may lead to monotony)


3 -1 -0.28

Appendices

381

Appendix XI Validation of the RIFs interaction matrix

DIRECT INFLUENCE

Tra

inin

g f

or

recovery

Pre

vio

us e

xperience w

ith e

quip

. fa

ilure

s

Experience w

ith s

yste

m p

erf

orm

ance

Pers

onal fa

cto

rs

Com

m. fo

r re

covery

Com

ple

xity o

f fa

ilure

Tim

e c

ours

e o

f fa

ilure

develo

pm

ent

Num

ber

of w

ork

sta

tions a

ffecte

d

Tim

e n

ecessary

to r

ecover

Exis

tence o

f re

covery

pro

cedure

Dura

tion o

f fa

ilure

Adequacy o

f H

MI

and o

per.

support

Am

big

uity o

f in

form

ation

Adequacy o

f ala

rms/a

lert

s

Adequacy o

f ala

rms/a

lert

s o

nset

Adequacy o

f org

aniz

ation

Tra

ffic

/tra

ffic

com

ple

xity

Airspace c

hara

cte

ristics

Weath

er

Task c

om

ple

xity

Training for recovery from ATC equipment failures

x x

Previous experience with equip. failures

x

Experience with system performance (reliance)

x x x x

Personal factors

x x x x x x x x x x x x x x x x x x

Comm. for recovery within a team of controllers

x x x x x x x x x x x x x x x x x x


x


x

Number of workstations/ sectors affected

x x

Time necessary to recover

x x x x x x x x x x x x x x x x x


x

Duration of failure

x x


x x x x


x x x x x x x


x x x

Adequacy of alarms/alerts onset

x x x x x

Adequacy of organization

x x x x

Appendices

382

Traffic/traffic complexity in the moment of failure

x x x

Airspace characteristics

x x x


Task complexity

x x x x x x x x x x x x x x x x x

NOTE: Please mark the interactions between each factor in the upper row and each factor from the left column. For example, does 'Training for recovery' influences any of the factors from the left side ('previous experience', 'experience with the system', 'personal factors', and so on). Please add or delete existing interactions as you find it appropriate.

Appendices

383

Appendix XII Distribution of 20 Recovery Influencing Factors (RIFs)

Level RIF1 RIF2 RIF3 RIF4 RIF5 RIF6 RIF7 RIF8 RIF9 RIF10

0.1 0 0 0 0 0 0 0 0 0 0

0.2 0 0 0 0 0 0 0 0 0 0

0.3 0 0 0 0 0 0 0 0 0 0

0.4 0 0 0 0 0 0 0 0 0 0

0.5 0 0 0 168 24 0 0 0 96 0

0.6 0 0 0 5964 2244 0 0 0 4272 0

0.7 0 0 0 67956 37908 0 0 0 58656 0

0.8 0 0 0 379116 266508 0 0 0 383184 0

0.9 2239488 0 0 1227984 1008576 0 0 0 1422000 0

1 8957952 13436928 0 2513604 2310156 0 8957952 0 3279840 8957952

1.1 2239488 6718464 0 1653636 1621692 0 4478976 0 2337228 4478976

1.2 0 0 0 3393708 3512088 0 0 0 5184840 0

1.3 0 0 0 2513604 2750052 0 0 0 4234404 0

1.4 0 0 0 1227984 1398444 0 0 0 2283432 0

1.5 0 0 0 379284 442464 0 0 0 786768 0

1.6 0 0 0 73920 82008 0 0 0 162216 0

1.7 0 0 0 73920 44760 0 0 0 17670 0

1.8 0 0 248832 379284 266688 0 0 0 780 0

1.9 2239488 0 3483648 1227984 1008576 0 0 0 6 0

2 8957952 13436928 8709120 2513604 2310156 13436928 8957952 10077696 0 8957952

2.1 2239488 6718464 3981312 1653636 1621692 6718464 4478976 6718464 0 4478976

2.2 0 0 3483648 3393708 3512088 0 0 3359232 0 0

2.3 0 0 248832 2513604 2750052 0 0 0 0 0

2.4 0 0 0 1227984 1398444 0 0 0 0 0

2.5 0 0 0 379284 442464 0 0 0 96 0

2.6 0 0 0 73920 82008 0 0 0 4272 0

2.7 0 0 0 73920 44760 0 0 0 58656 0

2.8 0 0 248832 379284 266688 0 0 0 383184 0

2.9 2239488 0 3483648 1227984 1008576 0 0 0 1422000 0

3 8957952 0 8709120 2513604 2310156 13436928 8957952 10077696 3279840 8957952

3.1 2239488 0 3981312 1653636 1621692 6718464 4478976 6718464 2337228 4478976

3.2 0 0 3483648 3393708 3512088 0 0 3359232 5184840 0

3.3 0 0 248832 2513604 2750052 0 0 0 4234404 0

3.4 0 0 0 1227984 1398444 0 0 0 2283432 0

3.5 0 0 0 379116 442440 0 0 0 786768 0

3.6 0 0 0 67956 79764 0 0 0 162216 0

3.7 0 0 0 5964 6852 0 0 0 17670 0

3.8 0 0 0 168 180 0 0 0 780 0

3.9 0 0 0 0 0 0 0 0 6 0

4 0 0 0 0 0 0 0 0 0 0

Appendices

384

Level RIF11 RIF12 RIF13 RIF14 RIF15 RIF16 RIF17 RIF18 RIF19 RIF20

0.1 0 0 0 0 0 0 0 0 0 0

0.2 0 0 0 0 0 0 0 0 0 0

0.3 0 0 0 0 0 0 0 0 0 0

0.4 0 0 0 0 0 0 0 0 0 0

0.5 0 0 0 0 0 0 0 0 0 0

0.6 0 0 0 0 0 0 0 0 0 0

0.7 0 0 20736 0 0 0 0 0 0 0

0.8 0 248832 684288 0 124416 0 0 0 0 0

0.9 0 2488320 3836160 746496 2363904 1492992 0 0 0 0

1 0 5474304 7527168 5971968 7589376 5225472 0 6718464 0 0

1.1 0 2488320 3545856 3732480 4354560 2985984 0 4478976 0 0

1.2 0 2488320 3836160 2985984 4976640 3359232 0 2239488 0 0

1.3 0 248832 684288 0 746496 373248 0 0 0 0

1.4 0 0 20736 0 0 0 0 0 0 6

1.5 0 0 0 0 0 0 0 0 0 696

1.6 0 0 0 0 0 0 0 0 0 14778

1.7 0 0 0 0 0 0 0 0 0 131736

1.8 0 248832 0 0 0 0 0 0 0 638880

1.9 0 2488320 0 746496 0 1492992 1119744 0 0 1903896

2 10077696 5474304 0 5971968 0 5225472 8957952 6718464 20155392 3719892

2.1 6718464 2488320 0 3732480 0 2985984 5598720 4478976 0 2405976

2.2 3359232 2488320 0 2985984 0 3359232 4478976 2239488 0 4929648

2.3 0 248832 0 0 0 373248 0 0 0 3719892

2.4 0 0 0 0 0 0 0 0 0 1903902

2.5 0 0 0 0 0 0 0 0 0 639576

2.6 0 0 0 0 0 0 0 0 0 146514

2.7 0 0 20736 0 0 0 0 0 0 146514

2.8 0 248832 684288 0 124416 0 0 0 0 639576

2.9 0 2488320 3836160 746496 2363904 1492992 1119744 0 0 1903902

3 10077696 5474304 7527168 5971968 7589376 5225472 8957952 6718464 20155392 3719892

3.1 6718464 2488320 3545856 3732480 4354560 2985984 5598720 4478976 0 2405976

3.2 3359232 2488320 3836160 2985984 4976640 3359232 4478976 2239488 0 4929648

3.3 0 248832 684288 0 746496 373248 0 0 0 3719892

3.4 0 0 20736 0 0 0 0 0 0 1903896

3.5 0 0 0 0 0 0 0 0 0 638880

3.6 0 0 0 0 0 0 0 0 0 131736

3.7 0 0 0 0 0 0 0 0 0 14778

3.8 0 0 0 0 0 0 0 0 0 696

3.9 0 0 0 0 0 0 0 0 0 6

4 0 0 0 0 0 0 0 0 0 0

Appendices

385

Appendix XIII Experimental material

Experimental material consists of various documents used by air traffic controllers

participating in the study, as well as the subject matter expert (SME). The documents

used by controllers are presented in the following order:

a) The controller handbook;

b) Debriefing interview sheet; and

c) Feedback form.

The documents used by subject matter expert are presented in the following order:

d) Subject matter expert’s assessment; and

e) Best practice procedure sheet.

Appendices

386

a) The controller handbook

TThhee CCoonnttrroolllleerr HHaannddbbooookk

Researcher: Branka Subotic

Supervisor: Dr Washington Y. Ochieng

University: Imperial College London

Location of experiment: XXX

June 2006

Appendices

387

SSUUBBJJEECCTT IINNSSTTRRUUCCTTIIOONNSS

Strategic and tactical decision making in ATC

Dear Controller, Welcome to the “Strategic and tactical decision making in ATC” research program. Because of your extensive experience as an Air Traffic Controller, you have been asked to participate in this study. Our aim is to test a new approach to better understanding of the decision making process by air traffic controllers. We will try to determine the cognitive processes that drive your decisions/actions during the dynamic and complex control of air traffic. The knowledge gained from this research will feed into the future design solutions of computerized ATC tools. We are not in position to reveal more information on this study at this point, as it may influence your behaviour, actions and, the processes we wish to observe and analyze. At the end of this study you will be more familiar with our objectives and you will be able to ask as many questions as you find necessary. So please bear with us and help us make this study as realistic as possible.

Your understanding and help are crucial at every step of this study! This study is designed as an integrated part of regular emergency training in Dublin ATC Centre with the minimal impact on the controller. Therefore, please consider and treat this training session as any other training session you have had in your professional career. From time to time, additional information may be given to you from the training instructor or researcher. In these occasions please act as if you would in the operational environment. Also, when information or instructions is given to you by the researcher, please regard it as if it comes from a training instructor.

Now, we would like you to read the “Consent form” which aims to inform you what the experiment involves and to make you fully aware of your rights while you are taking part in it. So please proceed to the next page, read the form, and sign it if you agree with all terms and conditions. If you have any questions, please do not hesitate to contact the researcher. In addition, we will ask you to fill out a questionnaire and participate in a de-briefing after the training session. The De-briefing part of this experiment is of high importance as we will compare the recorded data with your own experience and decision-making process. Therefore, we would like to encourage you to give the researcher detailed input and explanation.

Appendices

388

IMPERIAL COLLEGE LONDON RESEARCH SUBJECT INFORMED CONSENT FORM

The purpose of this research is to investigate the controller’s decision making process. You will be asked to complete one emergency training session and therefore perform air traffic control service through one traffic scenario. The entire experiment is expected to take approximately 1.5h to complete. The results of this experiment are for research purposes only, and may be presented at professional meetings or published in research literature. Your name will not be used in the reporting of results. Only recorded data will be used; all personal information will be kept completely confidential. A videotape of part of the experiment may be taken for purposes of data collection only. Neither your face nor identity will ever be associated with any reporting of these results. In addition, because of the confidentiality of this experiment, you will be asked not to disclose any information of what you have experienced today to anyone (including family, fellow colleagues, and friends) for a next 30 days. Only in this way we can be assured that the experiment will remain as realistic as possible. With your signature below you are accepting these conditions. If for any reason you are unable to comply with any of the listed conditions, please inform the researcher right away and you will be released of any other obligations. Additionally, if you wish to withdraw from the experiment, you may do so at any time. With Sincerest Thanks I, ________________________________, understand that my participation in this experiment is completely voluntary and that I may refuse to participate, or withdraw from the experiment, at any time without penalty. ___________________________________ _________________ Participant Signature Date I _______________________________ the researcher undertake to guarantee the confidentiality of the information you provided in this experiment. I understand that you reserve the right to seek legal redress should any aspect of this agreement be breached. ___________________________________ _________________ Researcher Signature Date

Prospective Research Subject: Read this consent form carefully and ask as many questions as you like before you decide whether you want to participate in this research study. You are free to ask questions at any time before or after your participation in this research.

Appendices

389

Now you are ready for training session!

~ When ready contact pseudo-pilot on dedicated R/T frequency so that your training session can be

initiated ~

Appendices

390

PPOOSSTT –– EEXXPPEERRIIMMEENNTT SSEESSSSIIOONN Dear Controller, Once again thank you very much for your participation is this experimental trial. Now you understand what our true objective in the experiment was and why we had to keep it confidential. Our objective in this research project is to research controller recovery from equipment failures in ATC. However, in order to achieve the unexpected effect of this rare occurrence, it was necessary to mask the real objective of this research. Our aim is therefore to determine how controllers manage equipment failures. The complexity of this experiment gave us the opportunity to test only one equipment failure in spite of the large number of potential equipment failures in any ATC Centre. By observing your reactions, recovery strategy, and attitude, we are aiming to identify better solutions in design of ATC tools/systems, recovery procedures, and training. Our belief is that current, more automated ATC Centres need to create better support to its main element – air traffic controllers. For the above reasons, we kindly remind you that you have agreed not to disclose any information and details from today’s experiment to your fellow colleagues, family, and friends in the next 30 days.

OOnnccee aaggaaiinn,, wwee wwoouulldd lliikkee ttoo hhiigghhlliigghhtt tthhaatt wwiitthhoouutt yyoouurr hheellpp aanndd uunnddeerrssttaannddiinngg tthhiiss rreesseeaarrcchh wwoouulldd nnoott bbee ppoossssiibbllee!!

Appendices

391

Post experiment questionnaire

IIff yyoouu nneeeedd ccllaarriiffiiccaattiioonn aatt aannyy ppooiinntt,, pplleeaassee ddoo nnoott hheessiittaattee ttoo ccoonnttaacctt tthhee rreesseeaarrcchheerr!! How suitable was your previous training to the situation (equipment failure) that you have just experienced? Please answer this question taking into account quality of training syllabus as well as the frequency of training. (Circle the appropriate number)

1. Suitable to the situation in question

2. Tolerable to the situation in question

3. Counter productive to the situation in question When was your last emergency training?

1. In the last 30 days 2. In the last 6 months 3. 1 year ago 4. More than 1 year ago

Did you have training on equipment failures during that session? Y N Do you need better or more frequent training for unusual situations, such as handling emergencies? Y N Please mark the statement that is closest to your previous experience with equipment failures:

1. I have experienced very similar or same type of equipment failure in the past. 2. I have not experienced this particular type of failure, but have experienced other

types of equipment failures previously. 3. I have never experienced equipment failure in my professional career.

Please mark the statement that is closest to your experience with ATC system:

1. I trust ATC technology more than I trust my own judgments. 2. I trust new ATC technology but I am aware of possible failures. 3. I do not trust new ATC technology, even though it is designed to make my job

easier.

Current rating: ACC RDR Proc Age ____ Years of experience as a controller: ____ APP RDR Proc TWR

Appendices

392

How would you rate your personal ability in today’s training session? Personal ability comprises different factors, not limited to: your level of fatigue, stress, confidence, complacency, your ability to cope with emergency situation, any family or other social group issues, etc. based on this explanation, rate your personal ability:

1. Suitable for the recovery process 2. Tolerable for the recovery process 3. Counter productive for the recovery process

How would you rate your communication for recovery today:

1. Efficient 2. Tolerable 3. Inefficient

Would you say that you had enough time to recover from the effect(s) of the equipment failure (taking into account possible development of less than adequate separation)?

1. Yes, time was adequate. Time necessary to recover was less than available time in the simulation.

2. No, time was not adequate. Time necessary to recover was in excess of available time in the simulation.

Is there relevant recovery procedure for this particular failure? Y N If yes, according to your opinion is that procedure:



3. Counter productive to the situation in question

How familiar are you right now with that procedure?

1. Very familiar

2. Semi familiar

3. Not familiar at all Would you say that HMI and operational support have been:



3. Counter productive to the situation in question Would you say that:

1. External working environment matched your internal mental model during recovery process

2. External working environment mismatched your internal mental model at any point of recovery

Appendices

393

How would you rate the adequacy of organisation in your ATC Centre?

1. Efficient

2. Tolerable

3. Inefficient How would you rate traffic complexity during the recovery process (please note: only during the recovery process and not during the entire training session):

1. High

2. Average

3. Low How would you rate the complexity of the airspace in the used scenario? The airspace complexity was:

1. Adequate 2. Tolerable 3. Inappropriate

How would you rate weather conditions during the recovery process?

1. Improved 2. Unchanged 3. Deteriorated

The quality of roles and responsibilities

The availability and adequacy of supervision

Attitude to teamwork

Support for organised exchange of past experience on eq. failures

Personnel selection process

Shift patterns and personnel planning

Availability of team members

Availability of additional support (e.g. Assistant)

Safety culture

Communication with management and technicians (e.g. Briefings, exchange of knowledge, bulletins)

Existence of stress management programs

The mix of IFR/VFR

Military aircraft

The existence of priority aircraft

Speed mix of aircraft

Amount of vertical movements

Amount of crossing movements

Amount of conflicts

The number of crossing points

Proximity of crossing point s to the sector boundaries

Number of flight levels

Number of entry points

Number of exit points

Special use airspace (SUAs)

Upper vs. Lower airspace

Airways configuration

The number of neighbouring sectors

Sector geometry (e.g. sharp edges)

Size of sector Bidirectional vs. unidirectional routes

Route length

Proximity of route to sector boundary

Appendices

394

Considering the entire training session how would you rate the overall task complexity:

1. Conflicting, multiple tasks existed during this training session.

2. Average complexity of the situation.

3. Extremely low complexity of the situation. How would you rate your recovery performance today?

1. Efficient

2. Tolerable

3. Inefficient How different your today’s performance is from any other day?

1. Not different at all

2. Similar

3. Very different How representative today’s performance have been of your overall ability to recover from an equipment failure in ATC?

1. Highly representative

2. Average

3. Not representative at all How realistic the today’s task was?

1. Highly realistic

2. Moderately

3. Not realistic at all Are you completely aware of the impact/implications of a particular failure that you have just experienced? Do you fully understand what will happen when particular equipment fails? Y N Any comment? Would you like to see some form of Aide-Memoire (flip chart, small laminated booklet, HMI drop down menu) available at each CWP to assist you in recognising the effects of a particular equipment failure and steps to be taken toward its recovery? Y N

Appendices

395

Is there any aspect of training, procedures, HMI, teamwork that could enhance your today’s recovery performance?

Thank you!!!!

Appendices

396

b) Debriefing interview structure

IMPERIAL COLLEGE LONDON

DEBRIEFING INTERVIEW STRUCTURE

Questions for each subject:

1. How did you notice/detect that there was an equipment failure? What info triggered the detection?

2. When exactly detection occurred?

3. What could have been the worst consequence if the situation was not detected?

4. Did you find diagnosis phase possible/necessary? If yes go to question 4. If no go to question 7.

5. What was your diagnosis?

6. What you did with it (i.e. tried to confirm, or rule out alternatives)?

7. Was the recovery strategy influenced by diagnosis?

8. How did you choose the recovery strategy to apply (i.e. based on training, own experience, colleague’s experience, any other source of info)?

9. What could have made the situation worse?

10. Can you think of any fall-back actions which could mitigate this situation? Can you suggest any changes to the procedures, phraseology; HMI design; fall-back procedures that could improve the situation?

Note: The researcher should replay the video recording from the moment of failure

injection and start further discussion with the subject.

Appendices

397

c) Feedback form

FEEDBACK FORM

Concerning the study conducted by representatives of

Imperial College London at XXX ATC Centre 06/06/06 – 09/06/06

Dear Controller, Having participated in this study we would like to ask you to provide your feedback on the importance and value of this study. Please answer all questions as accurately as possible, since these answers will guide us in our future endeavours. Your answers will be used only for the assessment of the usefulness of this study. Once again thank you very much for participating in this study! Please circle the appropriate answer:

Did you find participating in this study interesting? Y N

Do you think that this experience is beneficial for your future work? Y N

Do you feel that this experiment raised important issues? Y N

Do you feel that this experiment helped you to identify any gaps in your:

• Knowledge Y N

• Training Y N

• Skills Y N

• Awareness of effects of unusual events Y N

Would you be willing to participate in future studies of this type? Y N

Do you have any other comments on the experiment?

After completing, please return this feedback form to the office of XXX. Thank you for your time! Your cooperation is highly appreciated. Researcher Assistant

XXX, June 2006

Appendices

398

d) Subject matter expert’s assessment

AASSSSEESSSSMMEENNTT OOFF TTHHEE DDEEPPEENNDDEENNCCYY VVAARRIIAABBLLEE IINN TTHHEE EEXXPPEERRIIMMEENNTT

Our objective in this research project is to analyse the recovery from equipment failures in ATC. Since the area of ATC is highly specialised, it was necessary to evaluate the controller’s recovery performance using the expert opinion. As a Subject Matter Expert (SME) in the area of Air Traffic Control (ATC) you are asked to help in the assessment of the subject controller’s recovery performance. We kindly ask you not to disclose any information and details on this experiment to your fellow colleagues in the next 30 days so that we can assure the injection of failure as unexpected event for each subject-controller.


According to the controller performance that you observed in this experiment (either “live” or on the video recording of the experimental trial) it is necessary to use your professional experience and assess the effectiveness of the controller’s recovery.

Recovery is considered successful if the system returns to the normal or intermediate (but still stable) state. In the short term (as simulated in this experiment), the situation should be stable and control of airspace should be considered safe, but not necessarily efficient.

Please notice that the anchor points of each scale range from “Firmly Disagree” to “Firmly Agree.” Place a mark in one of the five boxes along each line, as shown in following example.

Example

In general, I am professionally more efficient in the mornings than evenings.

x

Firmly Partly Neutral Partly Firmly Disagree Disagree Agree Agree

1. The recovery strategy implemented by this controller can be considered successful.

Firmly Partly Neutral Partly Firmly Disagree Disagree Agree Agree 2. In this traffic scenario, it was possible to implement more than one recovery strategy.

Firmly Partly Neutral Partly Firmly Disagree Disagree Agree Agree

Appendices

399

If answered ‘partly agree’ or ‘firmly agree’, your answer referrers that you thought of alternative recovery strategy(s). Please describe briefly this/these alternative(s).

3. If you were in the place of subject-controller, would you implement different recovery strategy than he did?

Firmly Partly Neutral Partly Firmly Disagree Disagree Agree Agree If answered ‘partly agree’ or ‘partly disagree’, please specify your reasons to implement different recovery strategy and which recovery strategy that would be. In addition, please specify any particular/difficult issues regarding traffic situation during the recovery process:

Evaluation of the contextual factors in the training scenario: Please circle corresponding answers according to your professional experience and expertise:

How would you rate complexity of simulated failure type?

1. Single system affected 2. Multiple system affected

How would you rate the time course of simulated failure development?

1. It was sudden failure 2. It was latent failure. 3. It was gradual degradation of system.

Would you say that controller had enough time to recover from the effect(s) of the equipment failure?

3. Yes, time was adequate. Time necessary to recover was less than available time for recovery in the simulation.

4. No, time was not adequate. Time necessary to recover was in excess of available time for recovery in the simulation.

Is there recovery procedure for this particular failure? Y N If yes, is that procedure:

4. Suitable to the observed situation in question

5. Tolerable to the observed situation in question

6. Counter productive to the observed situation in question

Appendices

400

How would you rate duration of simulated equipment failure? 1. Short period of time (is it reasonable to consider them less than 15min) 2. Moderate period of time (is it reasonable to consider them less than 1h) 3. Substantial period of time (is it reasonable to consider them more than 1h)

How would you rate traffic complexity during the recovery process (please note: only during the recovery process and not during the entire training session).

1. High 2. Average 3. Low

How would you rate airspace complexity in the used scenario?

4. Adequate 5. Tolerable 6. Inappropriate

How would you rate weather conditions during the recovery process?

4. Improved 5. Unchanged 6. Deteriorated

How realistic the today’s task was?

4. Highly realistic

5. Moderately

6. Not realistic at all

Thank you!!!!

The mix of IFR/VFR

Military aircraft

The existence of priority aircraft

Speed mix of aircraft

Amount of vertical movements

Amount of crossing movements

Amount of conflicts

The number of crossing points

Proximity of crossing points to the sector boundaries

Number of flight levels

Number of entry points

Number of exit points

Special use airspace (SUAs)

Upper vs. Lower airspace

Airways configuration

The number of neighbouring sectors

Sector geometry (e.g. sharp edges)

Size of sector

Bidirectional vs unidirectional routes

Route length

Proximity of route to sector boundary

Appendices

401

e) Best practice procedure sheet

BEST PRACTICE PROCEDURE FOR XXX SIMULATION

Detect the problem � Either by pilot’s first contact or � Visually on the radar display (uncorrelated track). In this case first

assumption may be transponder failure. After confirmation that a/c transponder is serviceable, further check on system performance should be conducted.

Identify failure type either by ATCO or by input from the coordinator

Locate traffic

Check identity of all tracks (referring to the eastbound overflight)

Identify traffic using appropriate technique

Bearing/range Turn method

Inform all traffic on RTF of the failure and advise of possible restrictions

Maintain identification of all traffic

Ground trainer

Refuse departures permission to depart

Get all airborne traffic to land

Maintain accurate and timely strip marking throughout the process

Provide vertical separation

Utilize holding patterns when necessary

After restoration has been confirmed by coordinator: � Re-identify all traffic � Confirm Mode C � Continue to monitor � Release all departures

First possible detection/action may have occurred at: ______________ First actual action occurred at: ______________ End of the recovery process (release of the departures): ______________

Chapter 13 Appendices

Appendix XIV Overview of RIFs, their corresponding levels, and probabilities determined in the experimental investigation

(1) (2) (3) (4) (5) (6) (7) (8)

ID RIF name Descriptor Probability

(p)

Expected effect of

controller recovery

performance

Level Designator

(R)

Probability of overall situation occurring

(p*R)

Inte

rnal fa

cto

rs

1 Training for recovery from ATC equipment failure



1 1 0.73



2 0 0



3 -1 -0.03

2 Previous experience with equipment failures

Experienced with a particular type of failure or Experienced with any other type of ATC equipment failure


1 1 0.83

No experience with ATC equipment failures


2 0 0

3 Experience with the system performance (reliance or trust)

Objective attitude toward the system


2 0 0

Positive experience with the system (excessive trust) or Negative experience with the system (under-trust)


3 -1 -0.07

4 Personal factors

Suitable for the recovery process


1 1 0.83

Tolerable for the recovery process


2 0 0

Counter productive for the recovery process


3 -1 -0.03

5 Communication for recovery within team/ATC Centre


1 1 0.27


2 0 0


3 -1 -0.07

Equip

ment

failu

re r

ela

ted facto

rs


Single system affected

0 Non significant

2 0 0


1 Least favourable

3 -1 -1

7 Time course of failure development

Sudden failure 1 Improve 1 1 1

Persistent or latent failure

0 Non significant

2 0 0

Gradual degradation of system

0 Least favourable

3 -1 0

8 Number of workstations/sectors affected

One workstation/one sector or All workstations in one sector

0 Non significant

2 0 0

Appendices

403

Several workstations/couple of sectors or All workstations/all sectors

1 Least favourable

3 -1 -1


Adequate - less than available time


1 1 0.86

Inadequate - in excess of available time


3 -1 -0.14



0 Most favourable

1 1 0


0 Non significant

2 0 0

Inappropriate 1 Least favourable

3 -1 -1


Short period of time 1 Non significant

2 0 0

Moderate period of time or Substantial period of time

0 Least favourable

3 -1 0

Exte

rnal or

facto

rs r

ela

ted to w

ork

ing c

onditio

ns

12 Adequacy of HMI and operational support


0.5 Most favourable

1 1 0.5



2 0 0



3 -1 -0.11

13



1 Most favourable

1 1 1


0 Least favourable

3 -1 0

16 Adequacy of organisation


1 1 0.4


2 0 0


3 -1 -0.1

Airspace r

ela

ted f

acto

rs

17 Traffic complexity

Average traffic complexity


2 0 0

Extremely high or extremely low traffic complexity


3 -1 -0.65

18 Airspace characteristics

Adequate (e.g. enroute higher levels)

0.8 Most favourable

1 1 0.8


2 0 0

Inappropriate (e.g. enroute lower levels or terminal)


3 -1 -0.1

19 Weather conditions during the recovery process

Improved 0.83 Non significant

2 0 0

Deteriorated 0.17 Least favourable

3 -1 -0.17

20 Conflicting issues in the situation (task complexity)

Average complexity of the situation

0.3 Non significant

2 0 0

Conflicting, multiple tasks or Extremely low complexity of the situation (may lead to monotony)

0. 7 Least favourable

3 -1 -0.7

Appendices

404


The distribution of the recovery context indicator (Ic) obtained from the experimental

results is presented in Figure 1.

0

100

200

300

400

500

600

700

800

-0.088

-0.078

-0.068

-0.058

-0.048

-0.038

-0.028

-0.018

-0.008

0.00

2

0.01

2

0.02

2

0.03

2

0.04

2

0.05

2

0.06

2

0.07

2

0.08

2

0.09

2

0.10

2

0.11

2


Fre

qu

en

cy

Figure 1 Distribution of the recovery context indicator in the experimental investigation

(six RIFs defined through one level)

Based on the shape of the Ic distribution, the data has been fitted with two normal

distributions according to equation 1 (Figure 2). The distribution on the left accounts for

unfavourable recovery contexts whose recovery context indicator takes the average

value of -0.04 (A1=141.4, SD1=0.02). The distribution on the right accounts for

favourable recovery contexts whose recovery context indicator takes an average value

of 0.04 (A2=632.8, SD2=0.04).

204.02

2)04.0(

8.632

202.02

2)04.0(

4.141

22

2

2)2

(

2

21

2

2)(

1)(

1

×

−−

×+

×

+−

×=

−

−

×+

−−

×=

x

e

x

e

x

eA

x

eAxfσ

µ

σ

µ

1

Appendices

405

Figure 2 Fitting of the two normal distributions

PhD Thesis - HF in ATC

Documents

atc equipment failures

air traffic control

atc controllers

failure of equipment

thesis addresses

past research

efficient flow of traffic

controller recovery