Flight Examiner Methods

Flight Examiners’ Methods 1

Flight Examiners’ Methods of Ascertaining Pilot Proficiency Wolff-Michael Roth1,2

1University of Victoria; 2Griffith University

Abstract There are no studies about how flight examiners (check captains) think while

assessing line pilots during flight training and examination. In this study, 23 flight examiners

from 5 regional airlines were observed or interviewed in three contexts: (a) surrounding the

assessment of line pilots in simulator sessions; (b) stimulated recall concerning the assessment of

pilots; and (c) modified think-aloud protocols during assessment of flight episodes. The data

reveal that flight examiners use the documentary method, a mundane form of reasoning, where

observations are treated as evidence of underlying phenomena while presupposing these

phenomena to categorize the observations. Although this method of making sense of pilot

performance is marked by uncertainty and vagueness, different mathematical approaches are

proposed that have the potential to model this form of reasoning or its results. Possible

implications for theory and the practice and training of flight examiners are provided.

Keywords Debriefing · assessment · cognition · cognitive anthropology · thought process

A recently accredited flight examiner during stimulated recall: Probably your biggest fear

is having to fail someone. That’s why people say, “Your worst one always is your first

one.” . . . Becoming an examiner is like going for your first solo or your first commercial

nav. “Well done, you’ve got the job.” (D2)

A first officer during a post-debriefing interview: You’re always trying to remember

[the flight examiner’s] way of doing it. You’ve just got to remember to tick their boxes

and you’re okay. You’re also trying to remember the way they’re thinking. And if you get

back in to that then you don’t get so many comments.

A flight examiner is an authorized check pilot, who has taken on some of the duties of a

flight inspector on behalf of the regulatory authority (e.g., CAA-NZ, 2013; Transport Canada,


2013). Flight examiners are airline captains—serving both their airline and are responsible to the

regulatory authority—who assess the competencies of pilots against a national standard for

continued accreditation purposes and during type-rating. In the first introductory quotation, a

recently accredited flight examiner talks about becoming a flight examiner with rather little

experience or training in how to do the job or on how to assess such that the fears associated with

having to fail a pilot are lessened by the ability to ground the assessment in evidence. Such

comments concern the conceptual shifts individuals have had to make when becoming flight

examiners. Perhaps unsurprisingly, conversations with airline training and standards managers

reveal their eagerness to find out more about how flight examiners think for the purpose of using

such information in their training of new flight examiners, an increasing number of which are

needed because of the rapid expansion some airlines experience.

In the second introductory quotation, an experienced first officer (52 years of age, 30 years as

commercial pilot, 9,000 flight hours) talks about what really matters when he undergoes

examination following an operational competency assessment. First, he suggests that a pilot has

to remember how the particular examiner flies, which is the implicit referent for his assessment;

and a pilot also has to remember the way in which flight examiners are thinking, that is, the

processes of their thinking that lead them to make assessments and influences recommendations

for improvement or training. But how do flight examiners think? What evidence do they seek and

use to arrive at an assessment and how do they think (what are their thinking processes)?

Currently there are few studies; and those that can be found have been conducted in constrained

settings using pre-recorded video (e.g., Roth, Mavin, & Munro, 2014a) rather than observing

flight examiners at work.

Early work on pilot assessment investigated assessment in terms of measurement models and

focused on outcomes (e.g., Flin et al., 2003; Holt, Hansberger, & Boehm-Davis, 2002;

O’Connor, Hörmann, Flin, Lodge, & Goeters, 2002), where the measurement sometimes are

supported and enhanced by automated tools (e.g., Deaton et al., 2007; Johnston, Rushby, &

Maclean, 2000). More recent studies focus on the nature of the evidence that flight examiners

use in support of their ratings (e.g., Roth & Mavin, 2014). None of these studies investigate how

flight examiners think, the method or methods that they use to arrive at statements about the


proficiency, knowledge, skills, or states of pilots. This study, grounded in the cognitive

anthropology of work, was designed to investigate flight examiners’ methods. The ultimate goal

of this work is to construct a basis for the training and professional development of flight

examiners.

There exists considerable research in how experts from a variety of fields think, including

clinical reasoning of medical experts (Boshuizen & Schmidt, 1992), instructional designers

(Perez, Johnson, & Emery, 1995), historians (Wineburg, 1998), and scientists (Alberdi, Sleeman,

& Korpi, 2000). Such studies show the tremendous importance of subject-matter specific

knowledge in the forms of reasoning observed, knowledge that sometimes is not apparent

because it is encapsulated in the practical knowledge that experts have developed over the years

(Boshuizen & Schmidt, 1992). A recent study from aviation—investigating the ways in which

pilots, including flight examiners, assessed other pilots—showed that assessors vary in the facts

on which they base their assessment (Roth et al., 2014a). Moreover, first officers and captains

are going about assessment in ways that differed from flight examiners: the former move item by

item through an assessment form and identify performances that allow them to give a score on

that item whereas the latter first construct narrative descriptions and then map the result onto the

assessment form (Roth & Mavin, 2014). Both of these studies investigate assessment of short

scenarios. Such studies tell us little about how experts in assessment—flight examiners—actually

think in the context of real situations where they observe pilots over four-hour periods. In the

present study, therefore, we combine information gleaned in the field (by means of interviews

with flight examiners at three points during their work of examining pilots and videotapes of the

debriefings) with information gleaned from experimental settings (think-aloud protocols typical

for research on expertise).

Research Methods

This study was designed to investigate the methods flight examiners use to assess airline

pilots. We employ methods typical for cognitive anthropology. In this field, approaches from

different research traditions are combined, including field observations in real settings typical of

anthropology and think-aloud protocols and stimulated recall in constrained settings typical of

empirical cognitive science.


Design

In this study, flight examiners were recorded in three contexts: (a) in and related to actual

debriefing at work; (b) stimulated recall sessions; and (c) modified think aloud protocols

requiring pairs of flight examiners to assess crewmembers featured in videotaped scenarios. The

design included flight examiners participating (a) in the debriefings only, (b) in debriefings and

stimulated recall, (c) debriefings, stimulated recall, and think-aloud protocols, and (d) think-

aloud protocols only (Table 1). In the debriefing context, individual flight examiners were

recorded from 1 to 5 times (Table 1). The nature of this design provides for both breadth and

depth to the investigation of how flight examiners think while controlling for any particulars of

thinking as a function of the task.

««««« Insert Table 1 about here »»»»»

Participants

A total of 23 flight examiners (Age = 47.9 years, SD = 8.4) from 5 airlines participated in this

study; all were male. Eight flight examiners took part in debriefings only, with different numbers

of sessions—e.g., three were recorded during one session, 4 during two sessions, etc. (Table 1).

Two examiners were recorded surrounding the debriefing sessions and in stimulated recall

concerning one of the sessions. Four flight examiners participated in all three tasks—debriefing,

stimulated recall, and think-aloud protocols; and 8 individuals participated in the think-aloud

protocols only (Table 1). All flight examiners were experienced pilots, with a mean of 25.2 years

(SD = 8.7) as commercial pilots and mean accumulated flying time of 13,200 hours (SD =

4,140). They had served as flight examiners from 1 month to 23 years, with a mean of 8.5 years

(SD = 6.6).

At the time of the study, the flight examiners worked for regional airlines that had been

selected based on a 2 x 2 factorial design: (a) use (airlines B, D, E) or not (airlines A, C) of an

explicit, human factors based model of assessment of pilot performance (MAPP) (Mavin, Roth,

& Dekker, 2013) and (b) use (airlines C, D) or not (airlines A, B, E) of a debriefing tool, which

allows flight examiners to replay part of a simulator session featuring a video, some of the

instruments (e.g., electronic attitude director indicator [EADI], electronic flight instrument

system [EFIS]), and some actuators (e.g., control column, flap levers, power levers).


All participants in the debriefing and think-aloud parts of the study were randomly selected

among those who were willing to participate and that the company roster had available during

the field work periods. The flight examiners in the stimulated recall sessions had time because of

scheduling or were freed by their airline to be able to participate.

Ethics

This study was designed in collaboration with the participating airlines. In addition to being

approved by the university ethics board, approval was received, where applicable, from the

respective labor unions. All potential participants were guaranteed that their non/involvement in

the study would not affect their employment status; and participants were free to leave the study

at any time or to withdraw their data from use. No participant withdrew.

Tasks and Task Settings

Debriefing at work. Flight examiners were recorded during debriefing sessions that

followed actual examinations or training in the flight simulator. The flight examiners were

interviewed following the first half of the 4-hour simulator session and at the end of the second

half, immediately prior to debriefing. The flight examiners were interviewed again immediately

following debriefing. Participants were asked about what had been salient to them during the

simulator session, what they intended to bring up during debriefing, and how they felt that the

examinees/trainees were doing. Following debriefing, participants talked about their thinking

from the beginning of the simulator session to the end of debriefing, how they arrived at their

assessments, and how they selected what they intended to debrief. The interviews were semi-

structured, containing both specific questions directed to every participant (e.g., “What stood out

for you in the session?”, “What and how could debriefing be improved?” or “At what point did

you decide on your assessment and how?”) and providing opportunities to participants to

articulate any issues that they deemed relevant.

Stimulated recall. In the stimulated recall sessions, participants are asked to talk about their

reasoning during the assessment after the fact, generally using videotapes of their performances

of interest such as rater cognition (Suto, 2012). Participants talked about their thought processes

during the period from the beginning of the simulator session to the end of the debrief snippets

were replayed to them.


Modified think-aloud protocols. Think-aloud protocols constitute a standard method in

cognitive science for investigating the nature of expertise (Ericsson & Simon, 1993); the method

was employed in a modified form that has shown to be successful in recent aviation-related work

(Roth & Mavin, 2014). Three pairs of flight examiners each from airlines A (no-MAPP, no

debriefing tool) and D (MAPP, debriefing tool) were asked to assess both crewmembers

(captain, first officer) who appear in three videotaped scenarios.

Data Collection

Debriefing-related. The debriefings lasted between 11.1 and 57.2 minutes (X = 36.4, SD =

14.3) for a total of 17.6 hours; approximately 8 hours of interviews associated with the

debriefings were recorded. Debriefings were recorded in the respective companies’ regular

facilities using two cameras that captured all parts of the rooms; a third, laptop-based camera was

used as a backup.

Stimulated recall. A total of 5.1 hours of stimulated recall were recorded. Participants were

shown excerpts from the debriefings they had conducted. One camera was used, recording the

debriefing video and the notes that the flight examiner had taken during the session.

Modified think-aloud protocol. The think aloud sessions were recorded using three

cameras, one showing the workspace (notes), one featuring the pilots head on, and a third

recording what the participants were currently watching on their TV monitor. A total of 13.8

hours of think-aloud protocols were collected.

Across-task materials. In addition to the recordings, we draw on informal interviews and

observations from our ongoing ethnographic studies. The database includes all aircraft-specific

systems manuals, manufacturer- and airline-specific standard operating procedures and

procedures for abnormal situations for the different aircrtaft involved. It also includes the

authors’ field notes and all analysis-related and information-seeking exchanges with training

managers.

Analyses

Settings and processes. All recordings were transcribed verbatim in their entirety. The job

was contracted to a commercial provider with access to an individual who has aviation

experience. All transcriptions were verified in their entirety by the authors. For the purpose of


analysis, the different videos of the same situations were combined into a single display. To

analyze the data, we drew on interaction analysis (Jordan & Henderson, 1995), “an

interdisciplinary method for the empirical investigation of the interaction of human beings with

each other and with objects in their environment” (p. 39). It involves groups of researchers, both

those who “own” the project and colleagues interested in data sessions, who jointly analyze the

data with a commitment to ground their assertions and theories in the empirical evidence

available. Every assertion, every claim, has to be supported by evidence from the tapes

(transcriptions). Analyses begin during jointly conducted fieldwork, often replaying video in the

evenings following the recordings. In weeklong analysis sessions, sometimes involving

colleagues from other disciplines, were for held for developing the contents of the findings.

Analytic expertise. The analyses of cognitive tasks require related competencies. The

second author has 22 years of experience as a commercial pilot before becoming a university

professor; he continues working as flight examiner for a major aircraft manufacturer, and

provides workshops for flight examiners for different airlines. The third author has a total of 28

years of military and civil flight experience (8,500 flight hours), has been flight examiner for 9

years, and currently serves as training manager. The first author is an applied cognitive scientist

with extensive experience in the study of cognition at work. For the past three years, he has

engaged in cognitive anthropological study of assessment and debriefing in aviation. As part of

this work, he flew small aircraft, had simulator sessions flying larger aircraft, observed

simulator-based examinations, and accompanied pilots in the cockpit during regular line

operations.

Findings

“So we’re using specific examples to cover a generic fix.” (B3)

This study was designed to investigate how flight examiners think while assessing pilots

during regulator specified mandatory examinations. The introductory quotation from one of the

most experienced participant in this study (18 years as flight examiner plus 10 years as standards

and training manager) captures the essence of the flight examiners’ method: During the

examination in the simulator, the flight examiner develops “a generic fix,” a sense of the pilot’s

current abilities and then “uses specific examples” from the flight session as evidence, that is, as


concrete manifestation of the presupposed underlying phenomenon denoted by the generic fix.

This is the essence of what has been called the documentary method of interpretation

(Mannheim, 2004). In this method, observations are taken as evidence for, or documents of, an

underlying reality while using this reality as a resource for explaining or interpreting the

observation (Suchman, 2007). It corresponds to the mundane idealizing of reality (Pollner,

1987). All flight examiners, without exception, use the documentary method of interpretation

(Table 2). In fact, it has been suggested that this constitutes an everyday, mundane method of

making sense of the world (Garfinkel, 1967). The documentary method was employed even

when the flight examiners worked with an explicit model of assessment of pilot performance

with an associated assessment metric that mapped performance descriptions to a score (e.g.,

“Unable to recall facts or made fundamental errors in their recall” = 1 [unsatisfactory]

knowledge/facts or “Adequate organization of crew tasks” = 3 [satisfactory] of

management/workload). That is, rather than engaging in the measurement of assessment, flight

examiners employ ways of categorizing and explaining observations that underlie mundane and

formal scientific reasoning methods (e.g., Bohnsack, Pfaff, & Weller, 2010).


Flight Examiners Use a Documentary Method

The documentary method of interpretation as described in the literature is based on three

levels of sense: objective sense, expressive sense, and documentary sense. However, the

expressive sense pertains to a social actor’s intentions that cannot be objectively obtained. In the

present study, inferences that the flight examiners make about pilots intentions, therefore, are

treated as special cases of the documentary sense.

Objective sense. The objective sense of a situation refers to what different observers can

actually see and agree upon: their facts or their evidence. For example, flight examiners identify

indisputable facts, for example, that a pilot has (not) pushed the go-around button, what the

precise speed is (e.g., 145 knots, white bug + 10), or what the torque gauge reads. In the

following example from a debriefing, the flight examiner lists a set of observations that

constitute his objective sense, the facts used in the assessment of the pilots.

You were at 34 or something like that. Not a long way out. Told ATC. Made a PA. And then


came back . . . to around the 240 to 250 indicated mark until we got below 10. And then we were

sitting at 235 knots. (B3)1

The list includes where the pilots had made the decision to turn around (35 DME), that there

was an exchange with air traffic control followed by a public announcement. They were flying

with a speed between 240 and 250 knots until they were less than 10 nautical miles on the

distance measuring equipment (DME) at which point they were flying at a rate of 235 knots.

Such lists do not in themselves constitute an evaluation but are (a) used as manifestations of

underlying intentions and (b) treated as the manifestations (documentary evidence) of one or

more underlying, not-directly observable knowledge (aircraft or standard operating procedures),

skill (manipulative, communicative, management, or decision-making, skill), or state (e.g.,

situational awareness). Although more imprecise and fuzzy, non-technical areas are described in

terms of concrete, observable evidence. For example, a flight examiner described the slow

responses of a first officer observable in the cockpit, during debriefings, and during regular

conversations, which he suggested could be verified by the interviewer (see Table 3).

[The first officer’s] execution of procedures is, it’s slow. «First officer’s» response to everything,

and you’ll find this when you talk with him, is a very slow response . . . there’s considerable

delay and then you get a response. And you generally get the right response. (B1)


Documentary sense. Over the course of a session, the flight examiners built up a

whole/holistic sense of pilots (“a generic fix”), the evaluation of their skills, as per the

documentary evidence that is indicative of and explains the actual performance. For example, in

an icing condition during a single-engine approach, the pilots had not entered the correct speeds

in their landing speed card. All speeds (VREF, VAPP, VAC, and VFS) should have been the same.

This observation contributed to the flight examiner’s sense that the pilots had poor time

management and, as a result, forgot to enter the speeds as per operating procedures (“It comes

back to managing your time and what you actually want to achieve” [E1]).

Three categories of idealizations can be identified in the data: non/proficiency (pass/fail), 1 For cross-referencing purposes with Table 1, participant ID is given at the end of the transcription (e.g., “B3,” where “B” refers to airline B). Square brackets (i.e., [. . .]) enclose descriptive information; chevrons (i.e., «. . .») enclose replacements for proper names of persons, cities, and airports.


(non/technical) skills and knowledge (e.g., handling, decision-making, management,

communication), and states/processes (e.g., situational awareness, thinking). All of these

idealizations are mundane and uncontroversial cultural objects denoted by the language shared

within the aviation community or specific airline (e.g., Mavin & Roth, 2014). These

idealizations, which are taken to be underlying the opbserved performance, are not directly

accessible. Instead, the word or language used denotes a sense that arises with observations.

They might say, for example, “I had this gut feeling that something’s not right. Whether it was

body language or something I’d seen, I’m not sure. But something didn’t sit with me” (D4). This

“gut-level sense,” which often begins with the flight examiners’ observations of pilot behavior

during the briefing preceding the simulator session, subsequently is worked out in terms of the

evidence, as B3 said in the above quotation, “We are using specific examples to cover a generic

fix.”

Mutual determination of objective and documentary sense. Flight examiners are tasked

with an assessment of pilots’ competencies (proficiency) levels, holistically or, as in airlines B,

D, and E, in terms of ratings of a set of human factors. Any idealization given in the sense is

based on what flight examiners actually observe, the objective sense of the situation (facts, actual

performance), which is taken to be a manifestation (document) of what by nature is

unobservable. There is therefore a reflexive relationship between concrete observations and

idealization of the underlying reality (phenomenon): the former lead to the emergence of the

latter, but the latter explains the presence of the former. In the following example recorded

during a debriefing session, the flight examiner justifies a passing grade to a worried first officer

who has had some performance problems in the past.

You maintained situational awareness and were able to make the airplane follow the correct flight

path at all times. The decisions that you had to make today were easy ones today . . . considered

all the points, and I saw a little bit of evidence of that early on in your decision to divert to

«airport 1» You said, “Okay, we need engineering and we need runway length,” which then kind

of, that’s «airport 2» out the way. And obviously «airport 1» was the closest place. So I saw clear

evidence that you were actually diagnosing the situation and making sure that you considered all

the facts that you need to consider to generate the options, which enabled you to make your


decision. (B3)

In this explanation, the maintenance of a correct flight path is evidence for the pilot’s

situational awareness, the overt consideration of requirements for a diversion is evidence for

decision-making/diagnosing, and the active selection of an appropriate alternate airport is

evidence for decision-making/option generation. The state of the derived situational awareness

becomes a master concept, which is both evidenced in observational behaviors and performances

and explains these. The factual evidence determines the examiner’s sense that the pilot has

satisfactory decision-making skills (here options and diagnosis dimensions), and the satisfactory

decision-making skills explain the observed performance. This holistic sense in turn mediates

what flight examiners are looking for, and, therefore, what they collect as data and the intentions

that they take to be expressed in the objective facts. The relationship between an evolving

documentary sense and the objective sense of the situation can be seen at work in the thoughts of

a flight examiner from an airline using the explicit human factors model based Model of

Assessment of Pilot Performance (MAPP):

If I see something go wrong, then I sit there myself going, “Rightyo.” I then, as you say, visualize

the MAPP and go, “Okay, well where’s this fit in it? Did they lose situational awareness? ((Points

to item on visual MAPP model)) No. Okay, well what else could it have been? Well they flew the

aircraft within tolerances ((Points to item on MAPP)). Decisions? ((Points to item on MAPP))

Yep, they decided to go to «airport». Right call.” And I start trying to cross bits off and then

narrow it down myself. So reckon it’s management of the crew. (E1)

The conceptual tool (MAPP) provides the flight examiners with a way of mapping some

observable expression, a manifestation, to a presumed underlying performance or skill

(idealization).

How Flight Examiners Evolve the Objective Sense

To arrive at conclusions about pilots’ proficiency levels, knowledge, (non/technical) skills, or

states (situational awareness, thinking), flight examiners require documentary evidence on which

to base their assessment. In the simulator, they observe and generally take some notes in real

time without time out or recourse to revisiting an event. These notes are used both for

establishing the record of the flight performance observed and for the debriefing, where they


describe to the pilots what they have done for the purpose of critique or praise. In this subsection,

we report findings concerning the process of establishing the documentary evidence for flight

examiners’ conclusions.

Flight examiners differ in which facts and how many facts they identify. The assessments

are based on documentary evidence. One might ask whether flight examiners identify the same

kind and number of facts. This is difficult to establish in the context of regular examinations but

can easily be done when, as in the present case, the same flight segment is evaluated with the

possibility to repeatedly replay the segment. Whereas there is little debate about facts once they

are articulated (e.g., “the calls were non-standard at the bottom of the approach” [A5]), the

modified think aloud protocols that control for the assessment situations show there is variation

between the flight examiner pairs whether a fact is actually noted and therefore taken into

account in the assessment (Table 4). There tends to be no debate about what the standard

operating procedures say and whether a pilot action is consistent or inconsistent with these. For

example, only two of 6 flight examiner pairs noticed that the pilot in the scenario did not push

the go-around button, the first step specified for a go-around in the standard operating procedure.

This disengages the autopilot, which, by means of the flight director, continues to direct the pilot

to continue downward in the approach rather than upward. Because this step is missing in the

kinetic sequence of the cockpit as a whole (e.g., Roth, Mavin, & Munro, 2014b), the procedure

that follows is “messy” (A3, A4, A5, B3, C1, D2, D3, D4), “untidy” (B3, D4, D6), or otherwise

deemed inappropriate. But the origin of the messy procedure is not apparent to four of the

examiner pairs. Three pairs noted that the captain in the scenario “was flying against the bars,”

that is, had a positive rate of climb whereas the command bars directed him to head down. Two

of these pairs identified the missing engagement of the go-around procedure—pushing the go-

around button—as the source of this divergence. Finally, only three pairs noted the crucial fact

that passengers were evacuated on the side of the running engine after landing with a fire on the

other engine (Table 4). That is, facts about instruments, actuators, and observable performance

constitute a baseline that is relatively undisputed.



There is considerable variation in terms of the total number of facts articulated and taken into

consideration when flight examiners articulate the evidence on which they base their assessment

decisions. In the context of assessing in simulator sessions, flight examiners take notes of what

they observe. But the extent of these notes varies widely. We therefore investigated the number

of facts holding constant the event to be assessed. Thus, in the scenario with the inappropriate

evacuation, made salient and took into account different facts and different numbers thereof

(from 1 to 9) (Table 5). However, all those pairs who noted the evacuation on the side of the

running engine failed both captain and first officers, whereas those who did not passed both.


Flight examiners tend to be aware of the limitations of their evidence. Flight examiners

tend to be aware of the limitations of the documentary evidence that they obtain. They often find

out in the discussion with the pilots that they have missed something (e.g., “I must admit I didn’t

actually notice at the time too. It was only a bit later when I went oh, what’s going on here?”

[E1]); or report themselves having missed something (e.g., “I failed to note the point when the

autopilot was turned off” [B1]). In part, situations in which the flight examiners do not take

notice important facts arise while they take notes (“I don’t see with heads down” [A3]). While

having their heads down to write down observations, they are actually missing other potentially

relevant flight-related facts. As a result, examiners find themselves in situations where their own

observations and those pilots report differ. This is frequently made explicit in the training of new

flight examiners: “they teach us to try not to get yourself in that situation, because it’s quite, a bit

sort of, you know, he said, she said. I said, they said.”

Flight examiners noted the inherent contradiction in their task: To get the documentary

evidence that they need to support a pass/fail decision or their assessment of underlying skill

levels, they need to record their observations. But in the production of recording such such notes,

they miss out on observing flight relevant actions. The flight examiners from airline B explicitly

focus on observation while taking the scantest of notes (1–2 pages, some 15 observations). They

subsequently review their notes and what they remember in addition, pulling together all of the

information to arrive at an overall assessment as well as at assessments of categories of

performance, some of which may require special attention.


I think keeping the notes is actually the thing that’s distracting. I find myself starting to note

something down, I’ll see something else that’s happening and so I’ll stop what I’m doing, take

note of what’s happening and then I forget what I was writing down in the first place. And that’s

lost. Sometimes. That’s a bit of a pain. But you still get the overall picture. (B3)

The flight examiners in the other airlines, too, tend to take brief rather than extended notes

(up to 5 pages for a 4-hour session). These notes in themselves are insufficient as a repoertoire of

facts (“I keep my notes pretty short, so if you read them they probably wouldn’t make a lot of

sense to you. But it’s just a few words to jog my memory” [D3]). Instead, these notes trigger

(episodic) memory and allow examiners to bring back what happened and those facts that they

are using in the assessment. What is important to the flight examiners is the overall picture,

which is more important than a complete tally of all facts.

The conflict is mitigated to some extent for those flight examiners who have access to a

debriefing tool. This tool records the entire simulator session and includes a videotape of pilots,

shows what pilots view, and features representations of instruments and actuators. The debriefing

tool allows flight examiners to mark simulator events for subsequent replay in the debriefing.

The process of going from observation to assessment is mirrored in the use of the debriefing tool.

Thus, a flight examiner was observed marking for replay 21 events during a 4-hour simulator

session. However, he would not actually play all of these and instead focus on four. The total

number of marked events gives him a selection to work from. In the end, as the overall picture

emerges, the flight examiners then select those that he deems most valuable in terms of

triggering learning. In airline D, the marking process has been adapted to their performance

model (MAPP) such that the flight examiners can now mark events according to the agreed-upon

performance categories (e.g., knowledge, communication, decision-making). Even with the

debriefing tool, reviewing one or more sequences for the purpose of getting all the facts may (but

does not have to be) prohibitive in terms of time available and returns for the investment.

Flight examiners engage in targeted evidence collection. In some airlines, records on the

preceding examination are kept. Individual flight examiners might keep their notes or remember

having assessed individual pilots repeatedly. In both types of cases, flight examiners use the

records or their memory to look for documentary evidence to support statements about whether


or not a pilot has improved: “If he’s still having problems with his engine failure after take-off

then we might have to dig a little bit deeper in to it. And it just helps us tell whether something

that you see is random or systematic” (B3).

Important here is that flight examiners and training managers want to see whether a particular

(poor) performance is recurrent rather than a one-off in the actual performance. Sometimes flight

examiners and training manager choose events such that the evidence required in support of their

documentary sense is produced. This evidence then is used to teach the pilot a particular lesson:

“We know what areas they need to improve in and so sometimes, I have to confess, I would

introduce a malfunction at a difficult time for them to handle so that you can use it as a lesson”

(B3). Across a flight simulator session, flight examiners look for multiple pieces of evidence to

support their assessment of an underlying factor. Thus, in most real examination cases and in

contrast to evaluating brief video scenarios (e.g., Roth et al., 2014a), it is not the performance in

one individual situation that determines the assessment. Instead, the flight examiners build their

case based on the overall performance during the simulator session. In the following quotation,

the flight examiner supports his rating of 3 (satisfactory) rather than a 4 (good) on the technical

skill of flying the aircraft within limits because of one instance during a non-directional beacon

(NDB) approach, the aircraft was at the lower limit, which was taken as an indication that the

flight path management was problematic. But for the remainder of the examination, the pilot had

kept the aircraft well within the required limits.

Because we’re looking at a whole, you know, 2-hour, 3-hour session. And for example, «first

officer» got a three for flight path within limits on that exercise. Had it not been for the NDB

approach and the circling . . . he was only just fast enough. And so his flight path management for

the rest of the session was actually quite good ((i.e., rating = 4)). But that dragged it down. So it

was kind of holistic. (B2)

On rare occasions, an examination session is organized to have another flight examiner

provide an independent assessment. In such cases, the examiners use specific events to collect

evidence on the particular issues that the preceding examination/s had identified. They then

obtain the observation that makes the overall decision go one or the other way (“And as soon as I

sort of delved in to that area, it was like, right, that’s black and white” [D4]).


Flight examiners’ selection of events limits the types of facts that they anticipate to

observe. In the situation of the think-aloud protocols, flight examiners were confronted with

brief segments of flights with any knowledge of the context. The situation on the job is different

because flight examiners do not identify arbitrary facts. Instead, having programmed the events,

they have readied themselves to observe specific facts that are associated with this type of event.

Moreover, in a particular examination cycle, all pilots fly the same line-oriented flight segments

and do the same spot checks. From delays in required actions, they anticipate workload to

increase and pilots come under time pressure, which results to a loss in the awareness of the

situation as a whole. That is, flight examiners’ perception is configured by the choice and timing

of events. It also affords them to anticipate facts related to particular human factors areas that are

more salient than others. Each event has a set of challenges, or “boxes,” and the flight examiner

observes whether or not: ”they ticked every box” (D3) and “how well they do” (“The session has

actually got a little bit of stop start in it . . . to tick the boxes . . . so you see some slips and errors

that you wouldn’t normally see” [C2])

Flight examiners use repeat performance to increase the amount of documentary

evidence. The account provided so far may sound as if flight examiners do their work in an

unprincipled manner. But this is not so. To make their cases for the presence of particular levels

of proficiency, knowledge, skill, or state, flight examiners require documentary evidence. They

do not take a single instant as a case for proficiency, knowledge, skill, or state. This especially

important to them in those cases where the observed performance requires them to make a

pass/fail decision.

At that stage I hadn’t failed him; but I hadn’t passed him either. I was sitting there thinking,

“We’ve got a 4-hour session here; we’ll see how the rest goes.” Depending on how the rest goes,

we’ll need to come back and look at that. (D4)

Repeat observations. Flight examiners build their cases as they go along, taking their

observations as evidence that stand for some performance and the level thereof, which stand for

the underlying skill. For some, this mapping occurs immediately (e.g., supported by the

conceptual model), whereas others may wait. As the session progresses “things may change”

because something else might become more important: “So I do have all those thoughts while I


am going through, but when I come out of the sim, I ask myself, ‘What is the main, the big issue

here?’” (B3). Flight examiners then make one or a series of observations that determines their

decision:

So the guy next to him started managing it. And to me that was the point then when I had had two

individual exercises that weren’t managed well. So I was like, “rightyo, we’ve got issues here.”

And by this stage I had already decided, you know, he’s not going to pass today. (D4)

In another case, a flight examiner takes the fact that the first officer moves correctly through

the list of actions stated in the standard operating procedures as evidence for the presence of the

underlying knowledge. However, these steps occurred in the same manner across situations

(“He’s actually leveled out twice now and hasn’t pulled the power levers up.) There were two

observations where the first officer had leveled out without pushing the power levers forward. In

each situation the observation is evidence pointing to a performance problem. This performance

problem occurs across situations, and, in this, is consistent with an underlying skill issue. That

concern for manipulative ability is more serious than the concern with assertiveness, for which

the flight examiner has had evidence that can be fixed. One of the observations he has made is

good performance when the captain has been asked to fake incapacitation (by means of a

shoulder tap during the simulator exercise). In this situation, the first officer has performed well

(“stepped up . . . because he didn’t have to deal with the person next to him” [B1]).

Observations during “Repeats.” When performances are problematic potentially pointing to

underlying problems, flight examiners ask for repeating a situation, segment, or exercise to

collect further documentary evidence that allows them to get “a better fix” on a factor of interest.

Repeats provide further information about the underlying proficiency, knowledge, skill or state

underlying the pilot’s performance. If the pilot/s perform sufficiently well during the repeated

exercise, then this provides the flight examiner with evidence that there was an issue with the

particular performance not with the underlying dimension.

We did the exact same exercise again and he made the same mistake. And then just went through

the whole session making individual management mistakes. So someone like that, it’s actually

quite black and white. (D4)


The flight examiner provides documentary evidence for the fact that an underlying ability is

present but occluded in the performance. Thus, talking about a first officer, a flight examiner

suggested that “And that was the case with «first officer» on a couple of occasions where, for

example, with the briefing the wrong flap setting for landing and briefing the wrong speed. He

knew it, he just hadn’t realized it” (B3).

Flight examiners do not seek to ascertain the nature of evidence even when technology

affords it. With the debriefing tool, flight examiners do have the possibility to replay some event

and to ascertain the nature and number of facts (documentary evidence). But nowhere in the

present dataset does a flight examiner use or talk about using the debriefing tool to check an

observation. When an observation was checked, then always because of discrepancies between

the flight examiner’s and a pilot’s description of what was the case.

How Flight Examiners Develop and Articulate Their Documentary Sense

The documentary method of interpretation is a common, everyday method for determining

some assumed underlying pattern that also explains the observation—e.g., for finding out what

someone thinks, for a coroner to determine the course of events that led to a death, or for a

historian to describe the worldview of an era (Garfinkel, 1967; Mannheim, 2004). But precisely

because it is an everyday method, it is so powerful: the method not only helps in making sense

but also intuitively makes sense. Flight examiners employ the documentary method of

interpretation to determine whether pilots are non/proficient (pass/fail), what their non/technical

knowledge and skills are, or the pilots’ situational awareness. All of these are phenomena are not

given in themselves: these are cultural constructs that held to manifest themselves in observables.

These constructs therefore exist only in and as documentary sense.

Viewing the same scenarios, flight examiners evolve different patterns taken to underlie

performance given in documentary sense. Previous studies suggest considerable variation in

the ratings of pilots and flight examiners asked to assess the same video (Flin et al., 2003; Mavin

et al., 2013). Such variation is also observed here in the form of different appreciations of the

proficiency or non-proficiency of a pilot. Thus, the 6 flight examiner pairs did not come to

complete agreement on the level of the performance for of the six pilots they assessed in the

think-aloud part of this study: no two pairs had the same ratings across the six pilots (Table 6).


That is, even in a condition where flight examiners work in pairs such that individual subjectivity

is minimized, different conclusions are observed. What previous studies have not explained are

the reasons for such variations.


The flight examiners do not have access to the knowledge and skills underlying performance

or to a pilot’s grasp of the situation (i.e., situational awareness). Here, as in the case of whether

to pass or fail the pilot, they use the documentary method of interpretation. Because the cultural

objects are not given directly but indirectly through the manner in which they manifest

themselves and because flight examiners differ in the contents of their observations, as shown

above, the differences in flight examiners’ overall documentary sense become intelligible. The

flight examiners are most concerned with overall proficiency, which they tend to ascertain by

means of the question whether they would want themselves or their family members and friends

to be a passenger on the aircraft flown by that pilot. If the response is yes, then the pilot passes; if

no, the pilot fails. This is so independent of the root causes—i.e., human factors—attributed to

non/proficiency:

And at the end of the day, I don’t actually think it matters that much what you call it, as long as

you call it something. And you can say, “Look, what I did notice during this exercise, I know you

know this stuff, but you just couldn’t recall it.” (B3)

The documentary sense begins with an indeterminate feel that articulates itself over time

into a more grounded sense. Flight examiners observe pilots over the course of a four-hour

period and then make their assessment. But their sense of how a pilot is doing emerges early,

often during the initial encounter in the briefing preceding the simulator session but certainly as

soon as the session begins:

I guess one thing I’m thinking in a 4-hour session though is I’ve got no hurry to make up my

decision. You know, you do, and of all the times in the past where I’ve had to not pass someone,

there’s always some stage during the session where I’ve gone, “No, they haven’t passed, they

haven’t failed.” And then at the end you might say, “Well you need to come up with a result.”

(D4)

The beginning tends to be some very general and generic without much concrete (objective)


evidence description (e.g., “I’ve found the FO is very introverted and he’s either intimidated by

the simulator or he’s intimidated by the process” [A1], “He’s a plodder” [B1], “They are going

reasonably well” [A3], or “They’re getting through okay” [A4]). Sometimes flight examiners

note that their sense begins with observations of pilots’ “body language” (A4, C1, D3, D5). The

overall sense of whether a pilot is proficient or not, while evolving over the entire simulator

session, may start as soon as the session begins (e.g., “So the big picture actually developed over

the whole session. It might start as soon as you walk in” [C3]). This is so because there are

training sessions preceding the actual examination. Their observations during these sessions

configure the flight examiners’ sense at the beginning of the examination session:

And we spent three days leading up to it . . . so it wasn’t just a one off sort of day. However, over

those three days I’d sort of continued to work at all these things and we were making progress.

And my thought was, well he’s going to get through. It’s not going to be a great pass, but as long

as he keeps improving he’ll be fine. (D3)

As the examination session evolves, there is an increasing fixation of the general sense,

deriving from the increasing amount of concrete evidence available that can be used in the

documentation of the case ultimately made. Oftentimes flight examiners say with hindsight that

the problem has shown up from the beginning—e.g., in the body language of the pilot.

The evolving documentary sense shapes subsequent observations. When there is some

event, it cannot be known whether the problematic performance will be recurrent. It is only with

hindsight, after having repeated events that flight examiners will and do attribute the problem to

some inherent short-coming in the pilot, which leads to a fail rating. There therefore is path

dependence in the evolving overall sense concerning non/proficiency.

If they appear flustered, straight away, quite often at that point in time, they’ll say the wrong

thing, they’ll say, “Unscheduled feather” when it’s really a prop over speed. That sort of thing’s

quite common. And so usually if they’re going to start making mistakes it’s going to be a poor

performance, it starts happening quite early on. (B3)

Flight examiners seek further evidence (implicitly or explicitly) to confirm or disconfirm the

current documentary sense. As the modified think-aloud protocols reveal, in the attempt to locate

specific facts, flight examiners tend to find more negative evidence. In none of the 20 fail or pass


with marker cases was there an evolution from a more negative to a more positive sense. Instead,

flight examiners either began or moved to a more negative sense concerning a pilot’s

performance. Thus, the two examiner pairs where the emergency evacuation into the running

engine with noted while viewing the scenario had the definitive sense that it was a fail. One

flight examiner pair, during one of the repeated viewings, noted the running engine which led to

the reversal of their earlier sense of a good performance (pass) to a definitive fail.

By means of the documentary sense flight examiners evolve an entire explanatory

framework. Together with the general sense of overall proficiency that flight examiners evolve

based on their observations also arises an explanatory framework. Their observations contribute

to the emergence of a sense (e.g., level of situational awareness), which then explains the fact

observed. The associated idealization (e.g., situational awareness) might then be explained by

something else that is based on evidence (e.g., [workload] management). Thus, the fact that a

pilot delayed some task might be taken as evidence that there are problems with management,

which may have high workload as its consequence, which in turn lowers situation awareness.

That is, taken together all the explanatory terms that flight examiners use in their assessment

discourse—e.g., knowledge (facts, procedures), management, communication, decision-making,

and situational awareness (Mavin & Roth, 2014)—constitute the shared explanatory framework

based on the documentary method of interpretation.

Flight examiners distinguish between underlying pattern or momentary lapse. On the job,

flight examiners distinguish between momentary lapses and underlying problems. That is, flight

examiners distinguish between actually observed performance (evidence) and the presumed

underlying pattern. Flight examiners have to find out whether a poor performance is a

manifestation of an underlying problem or the result of something else. In the following

example, the captain’s was moving forward in the direction of the power transfer unit switch,

which sits right below the radio magnetic indicator, itself below the airspeed indicator. The

movement itself, which can be objectively seen, is a manifestation of the intention to go for the

power transfer unit (PTU) switch. The movement stops as a radio call comes in, and the pilot

attends to the call. The flight examiners suggests, “[the captain] had a brain fart after takeoff and

forgot to turn the PTU off because a radio call came through at that exact same time his hand


was going to it” (A3). The result is a “brain fart,” the pilot, upon returning from the call, does not

complete the action ascribed to the earlier movement. The sequence becomes a manifestation,

documentary evidence, for poor management in this situation. In contrast to the think-aloud task,

where flight examiners only rated single episodes, on the job where they observe pilots over two

4-hour sessions, they tend to reason in this way:

Good people can have a bad day in the sim and still come out smelling like roses because they’ve

got good management and good communication. It might have been a lapse or something that got

them in to that situation, however, their tools in their tool bag, their good management and

communication increases their own situational awareness. (D4)

This may explain the following observation. In the think-aloud tasks, five flight examiner

pairs failed captain and one pair rated it “repeat with markers” for being confused about the turn

(left or right) following a missed approach call (Table 6 below). On the other hand, both captains

who made wrong turns during the examinations observed in this study actually passed their

examinations. Although the wrong turns were recognized as serious errors, the captains passed

because they had exhibited good performance for the remainder of the two 4-hr sessions.

Situational awareness is a master concept that flight examiners evolve by means of the

documentary method. In the scholarly literature, there is a debate whether situational awareness

really exists or whether it is part of a folk model of human factors (e.g., Dekker & Hollnagel,

2004). In the present study, situational awareness is one of three types of cultural objects derived

by means of the documentary method. It has the status of a master concept in the explanations of

pilot performance even though, and perhaps because, flight examiners find it hard to assess.

Flight examiners treat situational awareness as a state rather than as a (non/technical)

proficiency, knowledge, or skill that needed to be maintained. However, there is awareness that

this dimension cannot be measured but is something (a cultural object) that manifests itself in

concrete actions that stand in a mutually constitutive relation with the underlying state (level of

situational awareness). If pilots do not have situational awareness, they would not be able to fly

correctly; and flying correctly inherently means having situational awareness.

If they didn’t have good situational awareness (SA), they wouldn’t be able to do it. Because like

we said before, you can’t measure SA as such. You’re, you’re judging how good their situational


awareness is based on the results of their situational awareness. (B3)

This flight examiner articulates, in his words, the core of the documentary method, the

reflexive relation between evidence (results) and idealization (situational awareness). Situational

awareness cannot be assessed—in fact, is different from the other performance aspects in that the

flight examiners treat it as a state that is affected by a range of human factors and circumstances.

They explain this state in terms of other human factors, inaccessible directly but manifesting

themselves in concrete actions and performances: “He could work on bettering knowledge,

because that would enable his management and pick up his situational awareness” (B2).

Flight examiners employ the documentary method of interpretation even when an explicit

assessment model and metric exists. This framework might be the same that he is using as part

of the assessment model that the airline uses, or it might be in terms of other human factors-

related concepts that characterizes the culture-specific discourse of flight examiners (e.g.,

“automation management,” “manipulation,” or “compliance”), or these might be other concepts

that provide an explanation for a range of concrete observations (“brain fart,” “airmanship,”

“currencies”). Pass/fail decision tend to be explained in terms of human factors-related concepts,

themselves given in the form of documentary sense

With «pilot», I said, “I’m not comfortable now.” But when I came back to the MAPP,

“Management, ineffective organization of crew tasks” ((1 [minimum standard])) So it was like,

“Yep, ‘controlled self or crew members actions [though with difficulty]’”. . . And when I came

back and then transferred it on to here ((assessment metric)) suddenly I found myself down here

((1 and 2 ratings)). And . . . this is why I have to not pass you. That gut feeling is confirmed. (D4)

Here, the flight examiner used the “word pictures” of the assessment metric—“Ineffective

organization of crew tasks” = 1 (unsatisfactory); “Controlled self or crew member actions,

though with difficulties” = 2 (minimum standard)—to translate between the performance he has

seen and the ratings. When he tallied his ratings on the different human factors of the assessment

metric, he ended up with a fail according to the company policy (one 1 or three 2 ratings). The

reverse has also been described. Thus, when a flight examiner is not happy with the results of the

tally of scores, which comes to a different result concerning passing or failing a pilot than what

his general sense is telling him, flight examiners describe changing individual scores to align the


content of the sense with the outcome according to the assessment categories and then use these

to explain the level of performance.

Flight examiners include informal cultural objects to constitute the overall sense of

proficiency. Other, even more mundane explanations can also be observed. Thus, for example,

the flight examiner in the following excerpt relates all his observations concerning handling

techniques in different situations to the fact that the captain also does the duties of a check

captain and, therefore, is not or cannot be as current on technique as a regular line pilot, who

does much more actual flying:

There was a few little handling techniques, training captain. I think it also comes down to training

and check captain, it comes down to currency as well, a big thing for the check captains. Just

currencies, because they obviously don’t fly as much as the line guys do. (C2)

Flight examiners do know that there is uncertainty in establishing just what the observation is

evidence of. This is evident in the statement one flight examiner made: “Sometimes it’s a bit

difficult to tell whether it’s a lack of knowledge or a lack of SA [situational awareness], because

sometimes they actually know it, but they just didn’t notice it” (B3). Here, the flight examiner

suggests that the underlying skill may actually be present or of a particular level but is not

expressed in that situation.

The documentary method amounts to good story telling. Some flight examiners describe

their method in terms of story telling. A good story binds together a large number of

observations into a simple story line. In this, a good story is similar to a parsimonious scientific

theory: it is convenient and convincing. Here, the documentary method amounts to creating a big

picture by putting together the right and required pieces of evidence that both create, hold

together, and make plausible the overall narrative; and it is the overall narrative that drives the

selection of the individual pieces of documentary evidence. An experienced flight examiner in

the process of training a captain to become flight examiner described the latter’s “natural ability”

for doing this job and substantiates his assessment from an informal setting in the hotel’s bar.

He can sit down and talk and discuss. It’s storytelling, he’s a storyteller. It’s obvious, last night

we were having a few beers, he’s a natural storyteller. People learn from that. And not everyone’s

got that ability. (D4)


Flight examiners’ level of experience changes the relation between facts and sense.

There are differences between beginning and experienced flight examiners with respect to the

dominance of one sense over another.

Experienced flight examiners let the documentary sense dominate the objective sense. The

overall assessment may override any other assessment, for example, derived by rigorously

implementing an assessment metric:

It would be fair to say that sometimes you’ll skew the results, I suppose. You know. If you think

somebody deserves to pass, but purely by reading the book, reading the metrics they’re going to

end up with more than two twos, then I probably make one of them a three. If I really thought

they had to pass. And conversely, um, if the metric said that the person’s passed, but I really

wasn’t happy, then I’d probably skew the results in that direction as well. (B3)

Here, too, a minor variation in marking determines whether a particular dimension of the

assessment model is a minimum standard or satisfactory, which entails a fail or a pass. There is

an awareness of the subjectivity (indeterminacy, uncertainty) involved and that what is a 2 rating

for one examiner might be a 3 rating for another. This is why the overriding sense governs

whether at the borderline (cusp) will pass or fail, and the documentary evidence will be adjusted

accordingly.

Less senior flight examiners focus on individual facts. The debriefings show that less

experienced flight examiners tend to focus on what more experienced examiners refer to as

minutiae of small errors rather than on the big picture (sense). When they have an explicit

assessment model, less senior flight examiners may entirely rely on the assessment metric—

consistent with the observations in another study that describes the assessments made by captains

and first officers to be driven by a tool whereas flight examiners are driven by their overall sense

(Roth & Mavin, 2014). Thus, beginning flight examiners, who tend to be more concerned with

individual facts, find themselves assisted in looking for the documentary evidence required for

making a particular assessment. They use the word pictures to map apparent observables (e.g.,

“Manipulated accurately, with no deviations from target parameters”) onto the associated score

(e.g., 5).

It’s a lot easier because you know, on our check forms that says ILS approach, it’s quite simple


looking at the word pictures: Did they fly it with no mistakes? Did they fly it with minimal

mistakes or did they project forward what was going to happen? Did they project backwards? Did

they think ahead? (D2)

The boundary between pass and fail becomes clearer with increasing examination

experience. There are suggestions that sensitivity between adjacent scores is low in pilot

examinations (Holt et al., 2002). The decision whether to pass or fail a pilot based on the

performance during a simulator session, though never taken lightly, becomes clearer with

increasing experience. This is an analogue case to natural scientists’ increasing competence in

classifying initially poorly distinguished natural objects, which comes with an increasing number

of cases that they classified in a study or over their career (Roth, 2005): “I’m more able now to

discern, even if I can see something that’s really poor, I’m still able to discern whether I’m going

to be happy at the end of the day to pass them or not” (D2). Looking back at their development,

flight examiners realize that they passed or failed a pilot in the past that they would now rate

differently.

We’ve all had one session in the past where you think, “Now I wouldn’t have passed that, but I

did then.” And I’ve got that, I’ve got one and I know it, I can even tell you the time of night that

it happened and where we were and I can tell you the two crew. I know what they did wrong and I

know what I did wrong. (D4)

Although the overall sense between what constitutes the difference between pass and fail

ratings becomes sharper, the difficulties between two adjacent scores remains difficult: “It’s not

a one is there and that box and a two is in that box. There’s like a 2.4 and a 2.5 and a 2.6” (B3).

Because of the difficulty to discriminate at that level, flight examiners tend to know that some of

them will score a pilot 2 and others 3. This description is consistent with the videotapes of the

think-aloud protocols, where the flight examiners using a rating scale for individual human

factors and components often vacillate between two adjacent scores.

Discussion

In this study, we provide evidence for and theorize the mundane way in which flight

examiners get their work done: the documentary method. Flight examiners’ inquiries concerning

pilot proficiency are based on the tight, reflexive relation between observational facts (evidence)


and mundane idealizations. In this section, we discuss the findings and then offer three

mathematical approaches that might be used to model (some aspect of) the documentary method.

Documentary Method in Pilot Assessment

This study was designed to investigate how flight examiners think at work, including line-

oriented flight examination (LOFE), operational competency assessment (OCA), or air transport

pilot license (ATPL). The results show that flight examiners use a documentary method of

interpretation to arrive at their sense of the different cultural phenomena of interest—

non/proficiency, (non/technical) knowledge and skill, or state (situational awareness, pilot

thinking). The phenomenon initially is given in vague terms, a general sense that becomes the

seed to an evolving idea of whether or not a pilot assessed is proficient or what the pilot’s level

of knowledge, skill, or state. With an increasing number of observations, the sense tends to

become more specific as it is increasingly concretized in documentary evidence. There is a

movement from abstractness to a concretely grounded sense, while there is a parallel movement

from the concrete to the abstract, as the vague notion becomes increasingly structured and fine-

grained. Ultimately the documentary method evolves at an explanatory framework in every

practical case. That is, the result of the documentary method covers everything, which is both its

strength and its downfall, as one commentator on social psychology notes: “In any actual case it

is undiscriminating and . . . absurdly wrong” (Garfinkel, 1996, p. 18).

Some readers may be taken to think that the documentary evidence is the same as classical

concept learning (Bruner, Goodnow, & Austin, 1956), where research participants derive

concepts from instances and non-instances (Figure 1). There are some significant differences,

however. In the classical case, the observations (facts) are clear, as there are only limited

numbers of attributes; and the concept can be given in an unambiguous manner, such as “two

circles or two boundaries” in the case of Figure 1. In the assessment of pilots, however, the

phenomena of interest themselves are fuzzy, as are many of the perceptual attributes (Roth &

Mavin, 2014). Although flight examiners talk about ideal performances, corresponding to the

prototype of a concept without that such a prototype has to exist in any hard way (Rosch, 1998).

Ideal performances exist only as approximations, for even when the pilots examined are flight

examiners themselves, the examiner highlighted aspects that could be improved. Unlike in


classical concept learning paradigm, the flight examiners tend to actively look for evidence of a

certain kind or introduce specific failures that condition the kinds of problems that the pilots will

face and consequences of which the flight examiner will observe. In the documentary method

approach, a concept (e.g., situational awareness) exists in and as the totality of evidence and,

therefore, never is abstract.

««««« Insert Figure 1 about here »»»»»

Past research on pilot assessment noted the considerable variations in the scores used as part

of a measurement paradigm (e.g., Flin et al., 2003; Mavin et al., 2013; O’Connor et al., 2002).

Flight examiners easily admit that assessing pilots is not a “hard science.” This experience is

captured in the notion of documentary sense. It goes together with an objective sense associated

with flight examiners’ concrete observations that they take to be manifestations of some cultural

object (e.g., decision-making skill or situational awareness). The inter-rater reliability approach

to human factors is based on the assumption that phenomena such as pilots’ knowledge,

decision-making, management, or communication are objective phenomena that can be

measured. When raters differ, problems are ascribed to the lack of rater training, the

measurement instrument, or some other variable. In this study, we show that the cultural objects

are not themselves given. Instead, they are treated as black boxes, the contents of which only

manifest themselves in some way rather than being directly given; but not all objective

manifestations reflect what is taken to be the real underlying pattern. There is mounting evidence

that in the flight examiners’ workplace, assessment is a categorization rather than a measurement

issue (Roth & Mavin, 2014; Roth et al., 2014a). We can find here the very source of the

variations observed in previous research on pilot assessment. Flight examiners do have (and can

give) good reasons for their sense that a pilot is or is not proficient—as can be seen when they

collaborate in an assessment in the think-aloud protocol part of the present study.

This study shows that flight examiners do not and cannot perceive all relevant facts

(attributes) of an event, which mediates how they rate the performance that can be seen. In a

more extreme example, those flight examiners who did not notice that the pilots assessed

evacuated the aircraft on the side of a running engine all passed the crewmembers, but those

flight examiner pairs noticing this fact all failed the crew. Whether the failure to observe such an


important aspect would also occur during a regular simulator session cannot be ascertained by

the data available. Given the control the flight examiners have over setting up situation and their

awareness of the flight as a whole, such cases may actually be rare—though in this study there

were instances where flight examiners had missed important aspects of the flight, such as

disabling the automatic pilot.

Material phenomena are directly available such that they can be pointed to, taken in hand, or

relatively agreed upon. Cultural objects, however, including “Galilean pulsars” (Garfinkel,

Lynch, & Livingston, 1981) or “help” (Mannheim, 2004) are assumed phenomena available only

indirectly: through their manifestations. Whether some material fact is a manifestation of an

underlying cultural phenomenon, a coincidence, or merely a contingency requires some

methodical approach. Flight examiners use a variety of methods to increase the evidence for or

against the existence of a phenomenon. Thus, for example, they select from a database of more

than 200 forms of incidence that might affect the flight in progress. They then conduct

observations on the pilots’ performances in response to the disturbance at hand. In the end, they

produce an assessment of the pilot or, in training situations, identify a collection of different

issues that the pilots should focus on for the purpose of professional development. What the final

narrative will be is unknown at the beginning. Yet every observation possibly has a place in the

final story line, which is in part constituted by the observation. Which observations will be

included depends on the overall narrative, but the overall narrative depends on the observations

made and salient for the purpose. The narrative is an emergent one and can change from one

instant to the next in the case of a serious performance issue.

This study also reveals that the relation between evidence and the phenomenon that it

supposedly manifests. That is, for example, a flight examiner’s sense that a pilot has lost or

diminished situational awareness derives from a particular observation; but this observation is

made and explained by an assumed level of situational awareness (lost, diminished). The

objective sense and the documentary sense go together and cannot be uncoupled. This is similar

to the findings of a study of classification where sociology graduate students were asked to code

hospital records for the purpose of identifying the organized ways of an outpatient clinic that led

to particular patient trajectories (Garfinkel, 1967). The study showed that the graduate students


not only assumed the knowledge that their coding procedures were to reveal but also such

knowledge was necessary to make decisions about what really happened in the outpatient clinic.

Studies show that experienced experimental biologists used this same method while attempting

to interpret, understand, and explain their data and the associated graphical representations

(Roth, 2014). Thus, the documentary method of interpretation differs from testing (given)

hypotheses, because the cultural object (proficiency, knowledge, skill, state) is itself a function

of the observations.

There is a temporal order when flight examiners work in the simulator, where they only have

“one shot” at making observations in any one instance. If they miss a real fact, it will not and

cannot enter the overall story line. When there is an opportunity for replay, such as with the

debriefing tool or in the case of the modified think-aloud protocol, facts may be discovered after

a first, second, or later viewing. (This was the case with the failure to push the go-around button,

which one pair of flight examiner noticed only after repeated viewing.) There is therefore an

emergent sense of what the narrative might be. After an initial assessment has been made, the

results can yet be revised, such that an initial pass (perhaps with markers) might turn into a fail.

In a small number of instances, flight examiners waited to find out from the pilots what they

have to say to a critical incident before assessing a particular event. Although rarely observed,

flight examiners do take up and take into account what they learn about some event into their

assessment of it. This is especially so when the debriefing tool is used to replay events. What has

happened as seen in the videotape is taken as the way, as objective evidence that overrides what

pilots or flight examiners remember.

The Documentary Method of Pilot Assessment: Three Mathematical Models

The assessment of pilots tends to be treated as a measurement issue with the associated

question of inter-rater reliability (Flin et al., 2003). The present study shows that flight examiners

draw on the documentary method for making sense of the simulator sessions and for arriving at

an assessment and at an explanatory framework. This appears to be consistent with suggestions

that meaningful criteria for consistently assessing performance are elusive (Rigner & Dekker,

2000), and, therefore, to be consistent with the idea that assessment cannot be modeled


mathematically. However, this is not the case. We briefly present three possible approaches to

mathematically model the “fuzziness” of assessment.

Fuzzy logic. Assessment may be modeled using the fuzzy logic approach (Roth & Mavin,

2013). The assessment category is found by the minimum distance D given the fuzzy sets

specifying the lower (BL) and upper boundaries of performance (BU) and a fuzzy relation W that

specifies the weight a rater observation is given in the assessment, a set of fuzzy observations A:

€

D = [Wj (BL, j −A j )]2

j=1

n

∑ + [Wj (BU , j −A j )]2

j=1

n

∑$

% & &

'

( ) )

1/ 2

In essence, the fuzzy logic approach maps a fuzzy set of given observations on assessment /

rating categories (e.g., pass and fail; or unsatisfactory, minimum standard, satisfactory, good,

very good). When the overall assessment is the result of rating different human factors, the same

observation may be used in one or more categories, such as situational awareness, management,

or decision-making. The same observation may therefore contribute in different ways to an

overall assessment, which may be based on an automatic failure because of low situational

awareness or because of problems in decision-making.

The present study shows that the set of fuzzy observations does not just exist but establishes

itself over time while flight examiners observe. Later observations may or may not cancel the

effect of earlier observations (e.g., in “repeats”). More importantly, there is a mutually

constitutive relation between the overall (documentary) sense and the observations sought

(objective sense). Neither aspect is modeled in the fuzzy logic approach. Thus, whereas it

appears useful in mapping a given set of (fuzzy) observations onto an outcome category system,

it does not model the reasoning process and the emergence of the observations. It is a static

model that takes the outcome of a process as its input.

Catastrophe theory. Transitions between binomial situations, such as changes in attitudes

(van der Maas, Kolstein, & van der Pligt, 2003), conceptual change of scientists (Roth, 2014), or

category formation among scientists (Roth, 2005) may be described mathematically drawing on

catastrophe theory (Figure 2). In the region of the cusp, small variations in the information

parameter can lead to sudden transitions from one to another state. The model depends only on

two control variables, a normal factor α and a splitting factor β. Both in attitude transition and in


categorization, the factor α corresponded to information available; factor β was involvement and

amount of experience, respectively. The category formation case is suitable in the present

instance, especially useful in modeling the assessments flight examiners make at the boundary

between a pass and fail rating. Van der Maas et al. (2003) provide eight flags indicative of the

suitability of the catastrophe theoretic model and different techniques to fit the catastrophe model

to the data.


This model is consistent with the results of this study that show a sharpening of the contrast

between pass and fail (Figure 2, along β axis); it is also consistent with the observation that

assessment at the boundary between two scores or pass and fail remains difficult. In the model,

minute variations in observation or circumstances can be the trigger for a transition to occur

(Figure 2, the jump from the lower to the upper surface of the cusp); or flight examiners actively

look for the tiny piece of evidence that allow them to pass or fail that reflects their overall sense.

Future research is required to test whether such models, already successfully explaining binomial

situations in other areas, are applicable to modeling assessments where flight examiners struggle

placing a performance in one of two adjacent categories.

Constraint satisfaction. Interpretation formation and classification may be modeled using

constraint satisfaction models, as one study showed in the case of navigation of navy vessels

(Hutchins, 1995). Here, different underlying attributes modeled in terms of nodes that represent

the current hypothesis concerning the attribute (Figure 3a). When information becomes

available, it feeds into the respective hypothesis, which has an activation level between 0 and 1.

There are sets of attributes supporting/reinforcing each other (+), whereas other pairs of

attributes counteract (–). In the model, one set of attributes supports a pass decision (the

activation levels of the 6 nodes are {1, 1, 1, 0, 0, 0}), whereas another set supports a fail decision

(the activation of the 6 nodes are {0, 0, 0, 1, 1, 1}). At any one point, the distance of the network

from the two extreme cases can be calculated (e.g., using a Euclidean metric in the vector space).

The assessment or interpretation formation trajectories can then be graphed (Figure 3b), as

shown in models of how artifact design processes evolve (Roth, 2001). In Figure 3b, all four

trajectories shown begin with the same initial state: two leading to a pass decision, one to fail


rating, and one process remains undecided between as no new information was provided at that

point. This latter may be taken as representing the pass with marker ratings, which flight

examiners used when a performance was not a clear pass rating but insufficient enough to

warrant a fail. The network model corresponds to the observation that the overall state of the

documentary sense (current interpretation, inclination) is a function of the parts (information

about different attributes) but each attribute is a function of the documentary sense. This type of

model therefore most clearly represents the evolving nature of the process of an assessment.

Future research would be required to test the fit of constraint satisfaction model with concrete

data from pilot assessment.


Implications

Research already showed that flight examiner assessment is based on and can be modeled by

means of fuzzy concepts and fuzzy observations (Roth & Mavin, 2014). The present

investigation extends past research by showing that flight examiners use a documentary method

of interpretation that evolves a relation between an overall sense and concrete, sometimes exact

(e.g., current speed, torque, or presence/absence of a procedural call) and sometimes more fuzzy

observations (e.g., whether a captain is leaning on the first officer or whether they have an open

discussion). The emerging sense both is a function of and drives factual observation. This study

therefore allows us to anticipate variations between the assessment results of flight examiners,

who nevertheless have and can provide good reasons, as seen in the think-aloud protocol and

stimulated recall parts of this study. There is variation even though these flight examiners go

“systematically” about their evaluations. These results therefore provide an explanation to the

disappointing results of a study, where after three years of training, authors conclude that “there

where typically several I/Es who gave noticeably different distributions of ratings,” “[s]ystematic

differences among raters were typically found,” “it may be difficult to achieve [consistency in

the .70s],“ “agreement for specific items is often inadequate,” and “sensitivity levels were quite

low across the 3 years” (Holt et al., 2002, pp. 324–325). There is therefore mounting evidence

for the hypothesis that variation is true, therefore mitigating any training effort that attempts to

increase rater calibration.


The true purpose of pilot assessment is the overall improvement of safety in the industry

rather than a school-like grade for each pilot. Thus, rather than falling into despair over

irremediable rater variance, we might ask how the observed variations or its underlying causes

might be used positively to improve safety in aviation. In the context of the participant airlines,

this research team has begun to work with flight examiners to change the practice of debriefing,

giving more space to the reflections of pilots on their own practices, such that the focus of the

biannual two-day session is on the learning with a decreased emphasis on the grading while

maintaining the identification of problem areas.

In the documentary method, the presumed underlying patterns (sense) are based on the

observations (evidence), which are in turned explained by the patterns. Practitioners might be

interested to focus on increasing the number of observations, thereby increasing the number of

pieces of evidence substantiating the sense flight examiners have concerning proficiency,

knowledge and skills, or state. This is different from the behavioral marker approach to

assessment, where markers are rated on a numerical scale (e.g., Flin & Martin, 2001). Flight

examiner assessment would be based on observable evidence rather than on ratings of

overarching but inaccessible factors. Diagnostic tools such as the Enhancing Performance With

Improved Coordination (EPIC) tool (Deaton et al., 2007), which alert instructors to specific facts

easily synthesized from simulators, may turn out to assist flight examiners in collecting more

evidence than they have done in the past. As a result, the approach would reflect the increasing

tendency of the industry to ground decisions in solid evidence. But the increase in the amount of

evidence should be balanced by efforts to get the big picture, which amounts to conceptualizing

performance and telling a parsimonious and coherent story concerning the proficiencies of pilots.

References

Alberdi, E., Sleeman, D. H., & Korpi, M. (2000). Accommodating surprise in taxonomic tasks:

The role of expertise. Cognitive Science, 24, 53–91.

Bohnsack, R., Pfaff, N., & Weller, W. (Eds.). (2010). Qualitative analysis and documentary

method in international educational research. Leverkusen, Germany: Barbara Budrich

Publishers.


Boshuizen, H. P. A., & Schmidt, H. G. (1992). On the role of biomedical knowledge in clinical

reasoning by experts, intermediates and novices. Cognitive Science, 16, 153–184.

Bruner, J. S., Goodnow, J., & Austin, G. A. (1956). A study of thinking. New York, NY: Wiley.

Civil Aviation Authority of New Zealand (CAA-NZ). (2013, February). Flight test standards

guide: Airline flight examiner rating. Accessed August 20, 2014 at

http://www.caa.govt.nz/pilots/Instructors/FTSG_Airline_Flt_Examiner.pdf

Deaton, J. E., Bell, N., Fowlkes, J., Bowers, C., Jentsch, F., & Bell, M. A. (2007). Enhancing

team training and performance with automated performance assessment tools. International

Journal of Aviation Psychology, 17, 317–331.

Dekker, S., & Hollnagel, E. (2004). Human factors and folk models. Cognition, Technology and

Work, 6, 79–86.

Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis: Verbal reports as data (Rev. ed.).

Cambridge, MA: MIT Press.

Flin, R., & Martin, L. (2001). Behavioral markers for crew resource management: A review of

current practice. International Journal of Aviation Psychology, 11, 95–118.

Flin, R., Martin, L., Goeters, K., Hörmann, H., Amalberti, R., Valot, C., & Nijhuis, H. (2003).

Development of the NOTECHS (non-technical skills) system for assessing pilots' skills.

Human Factors and Aerospace Safety, 3, 97–119.

Garfinkel, H. (1967). Studies in ethnomethodology. Englewood Cliffs, NJ: Prentice-Hall.

Garfinkel, H. (1996). Ethnomethodology’s program. Social Psychology Quarterly, 59, 5–21.

Garfinkel, H., Lynch, M., & Livingston, E. (1981). The work of a discovering science construed

with materials from the optically discovered pulsar. Philosophy of the Social Sciences, 11,

131–158.

Holt, R. W., Hansberger, J. T., Boehm-Davis, D. A. (2002). Improving rater calibration in

aviation: A case study. International Journal of Aviation Psychology, 12, 305–330.

Hutchins, E. (1995). Cognition in the wild. Cambridge, MA: MIT Press.

Johnston, A. N., Rushby, N., & Maclain, I. (2000). An assistant for crew performance

assessment. International Journal of Aviation Psychology, 10, 99–108.


Jordan, B., & Henderson, A. (1995). Interaction analysis: Foundations and practice. Journal of

the Learning Sciences, 4, 39–103.

Mannheim, K. (2004). Beiträge zur Theorie der Weltanschauungs-Interpretation [Contributions

to the theory of worldview interpretation]. In J. Strübing & B. Schnettler (Eds.),

Methodologie interpretativer Sozialforschung: Klassische Grundlagentexte (pp. 103–153).

Konstanz, Germany: UVK.

Mavin, T. J., & Roth, W.-M. (2014). A holistic view of cockpit performance: An analysis of the

assessment discourse of flight examiners. International Journal of Aviation Psychology, 24,

210–227.

Mavin, T. J., Roth, W.-M., & Dekker, S. W. A. (2013). Understanding variance in pilot

performance ratings: Two studies of flight examiners, captains and first officers assessing the

performance of peers. Aviation Psychology and Applied Human Factors, 3, 53–62.

O’Connor, P., Hörmann, H. J., Flin, R., Lodge, M., & Goeters, K.-M. (2002). Developing a

method for evaluating crew resource management skills: A European perspective.

International Journal of Aviation Psychology, 12, 263–285.

Perez, R. S., Johnson, J. F., & Emery, C. D. (1995). Instructional design expertise: A cognitive

model of design. Instructional Science, 23, 321–349.

Pollner, M. (1987). Mundane reason: Reality in everyday and sociological discourse.

Cambridge, UK: Cambridge University Press.

Rigner, J., & Dekker, S. W. A. (2000). Sharing the burden of flight deck automation training.

International Journal of Aviation Psychology, 10, 317–326.

Rosch, E. (1998). Principles of categorization. In G. Mather, F. Verstraten, & S. Anstis (Eds.),

The motion aftereffect (pp. 251–270). Cambridge, MA: MIT Press.

Roth, W.-M. (2001). Designing as distributed process. Learning and Instruction, 11, 211–239.

Roth, W.-M. (2005). Making classifications (at) work: Ordering practices in science. Social

Studies of Science, 35, 581–621.

Roth, W.-M. (2014). Graphing and uncertainty in the discovery sciences: With implications for

STEM education. Dordrecht, The Netherlands: Springer.


Roth, W.-M., & Marvin, T. J. (2013). Assessment of non-technical skills: From measurement to

categorization modeled by fuzzy logic. Aviation Psychology and Applied Human Factors, 3,

73–82.

Roth, W.-M., & Mavin, T. J. (2014). Peer assessment of aviation performance: Inconsistent for

good reasons. Cognitive Science. DOI: 10.1111/cogs.12152

Roth, W.-M., Mavin, T. J., & Munro, I. (2014a). Good reasons for high variance (low interrater

reliability) in performance assessment: A case study from aviation. International Journal of

Industrial Ergonomics, 44, 685–696.

Roth, W.-M., Mavin, T., & Munro, I. (2014b). How a cockpit forgets speeds (and speed-related

events): toward a kinetic description of joint cognitive systems. Cognition, Technology and

Work. DOI: 10.1007/s10111-014-0292-0

Suchman, L. (2007). Human-machine reconfigurations: Plans and situated actions. Cambridge,

UK: Cambridge University Press.

Suto, I. (2012). A critical review of some qualitative research methods used to explore rater

cognition. Educational Measurement: Issues and Practice, 31, 21–30.

Transport Canada. (2013, January). Pilot examiner manual (4th ed.). Ottawa, Canada: Minister

of Transport. Accessed August 22, 2014 at

http://www.tc.gc.ca/publications/en/tp14277/pdf/hr/tp14277e.pdfconduct

van der Maas, H. L. J., Kolstein, R., & van der Pligt, J. (2003). Sudden transitions in attitudes.

Sociological Methods & Research, 32, 125–152.

Wineburg, S. (1998). Reading Abraham Lincoln: An expert/expert study in the interpretation of

historical texts. Cognitive Science, 22, 319–346.

Flight Examiner Methods

Documents