Page 1
1
Constructing faces from memory: the impact of image likeness and prototypical
representations
Charlie D. Frowd (1*)
David White (2)
Richard I. Kemp (2)
Rob Jenkins (3)
Kamran Nawaz (4)
Kate Herold (4)
(1) Department of Psychology, University of Winchester, Winchester SO22 4NR, UK
(2) School of Psychology, University of New South Wales 2052 Australia
(3) Department of Psychology, University of Glasgow G12 8QQ UK
(4) School of Psychology, University of Central Lancashire PR1 2HE UK
* Corresponding author: Charlie Frowd, Department of Psychology, University of
Winchester, Winchester SO22 4NR, UK. Email: [email protected] .
Phone: (01962) 624943.
Running head: Pictorial influences on face construction
Page 2
2
Research suggests that memory for unfamiliar faces is pictorial in nature, with
recognition negatively affected by changes to image-specific information such as head
pose, lighting and facial expression. Further, within-person variation causes some
images to resemble a subject more than others. Here, we explored the impact of
target-image choice on face construction using a modern evolving type of composite
system, EvoFIT. Participants saw an unfamiliar target identity and then created a
single composite of it the following day with EvoFIT by repeatedly selecting from
arrays of faces with ‘breeding’, to ‘evolve’ a face. Targets were images that had been
previously categorised as low, medium or high likeness, or a face prototype
comprising averaged photographs of the same individual. Identification of
composites of low likeness targets was inferior but increased as a significant linear
trend from low to medium to high likeness. Also, identification scores decreased
when targets changed by pose and expression, but not by lighting. Similarly,
composite identification from prototypes was more accurate than those from low
likeness targets, providing some support that image averages generally produce more
robust memory traces. The results emphasise the potential importance of matching a
target’s pose and expression at face construction; also, for obtaining image-specific
information for construction of facial-composite images, a result that would appear to
be useful to developers and researchers of composite software.
(224 words.)
Originality: This current project is the first of its kind to formally explore the
potential impact of pictorial properties of a target face on identifiability of faces
created from memory. The design followed forensic practices as far as is practicable,
to allow good generalisation of results.
Page 3
3
Witnesses and victims of crime often work with a forensic practitioner to
produce a likeness of an offender's face from memory. The resulting image is used by
law enforcement to generate new lines of enquiry in the hope of identifying the
offender. There are two contrasting types of software system employed to recover
these so-called ‘composite’ images from memory: observers construct a face by
selecting individual facial features—eyes, nose hair, mouth, etc.—or they select
whole faces from arrays of alternatives, with ‘breeding’, to ‘evolve’ a face (for a
review of computerised and non-computerised methods, see Frowd, Carson, Ness,
Richardson et al., 2005). Considerable research effort has been carried out over the
past four decades to understand these methods, identify their strengths and
weaknesses, and make improvements (for a review, see Frowd, 2012).
To test the effectiveness of composite-construction systems, a design is
normally used that simulates the applied context. Participants inspect a photograph of
a target person who is unfamiliar to them, and construct a composite of that person’s
face. Subsequently, another group of participants who are familiar with the target
attempt to recognise the constructed composite (e.g. Brace, Pike & Kemp, 2000;
Frowd, Carson, Ness, McQuiston et al., 2005; Valentine et al., 2010).
Image-choice is an important consideration for memory research using
unfamiliar-face stimuli, which is because superficial differences between target and
test images can have a significant effect on memory accuracy (e.g. Bruce, 1982).
When creating composites of target faces, the salience of these superficial pictorial
properties, resulting from transient environmental variables, is likely to reduce their
effectiveness. For example, Figure 1B shows a selection of photographs sourced from
Google Image. These images vary with respect to a range of environmental factors
including head pose, expression, head angle, lighting, camera-to-subject distance and
Page 4
4
lens characteristics. For images of unfamiliar faces, even modest changes along such
dimensions can cause substantial error in both memory for unfamiliar faces (e.g.
Bruce, 1982; Davies & Milne, 1982; Longmore, Liu & Young, 2008; Valentine &
Bruce, 1988) and matching tasks where images are simultaneously presented (e.g.
Bruce et al., 1999; Jenkins, White, Van Montford & Burton, 2011). Further, since
most approaches for creating composites produce images that are standardized with
regards to pose, expression and lighting, the face-construction task is likely to be at
odds with the image-specific nature of unfamiliar face memory.
Whilst memory for unfamiliar faces is sensitive to image-specific variation,
recognition of familiar faces appears to be largely unaffected by these variables. For
example, familiar faces can be recognised very accurately even when image quality is
very poor (Burton, Wilson, Cowan & Bruce, 1999). Similarly, when facial
composites are constructed by participants that have prior familiarity with the target
identities, subsequent recognition of these composites is improved dramatically
relative to composites constructed of unfamiliar faces (Davies et al., 2000; Frowd,
Skelton et al. 2011). Thus, the low levels of composite identification accuracy
reported in the literature may be due in part to the inadequacy of face constructors’
memory representations rather than limitations in the process of memory construction.
In the current study, we examined the contribution of face representation to
the quality of constructed likenesses from memory: we exposed face constructors to
either photographs of faces or to prototypes derived from multiple images of an
individual’s face. The face prototypes were generated according to a procedure
described by Burton, Jenkins, Hancock and White (2005), who modelled the process
of face familiarisation as a cumulative refinement of memory representations driven
by a simple image averaging process. Indeed, by calculating the average values of
Page 5
5
correspondent pixels across a range of photographs of the same person, the
researchers produced average images that were recognised more easily by both
humans and computer algorithms than the individual exemplars used to create these
averages (Burton et al., 2005; Jenkins & Burton, 2008). This result indicates that
average images provide a more stable representation of identity than individual
photographs, by strengthening features that are consistent across images whilst
softening the contribution of uncorrelated features. The resultant representation has
reduced low-level image variation and tends to be neutral for the various image-
specific factors listed above that are known to impede unfamiliar face processing.
Our aim for exploring this more theoretical (less applied) issue was that facial
composites based on a memory of a face prototype—a representation based on the
average of several photographs—will be more accurately recognised than a composite
based on any one of the photographs making up the average. This is because the
nature of a particular image of the target (an individual instance) will be influenced
both by the characteristic aspects of the target’s appearance and by image-specific
factors such as lighting, distance, perspective and properties of the camera. When
people are unfamiliar with a target face, they are unable to reliably separate this
variance from identity-specific information in the image (e.g. Hill & Bruce, 1996;
Jenkins et al. 2011; Liu & Ward, 2006). The outcome is that people are likely to
create a composite based on image-specific details, information which does not
provide useful cues to identity, and so compromise composite identification.
We also explored the effect of using different instances of the same person on
face construction. In forensic construction, witnesses and victims create a composite
from a specific memory—an instance of the face—and so an understanding of the
importance of pictorial information is worthwhile forensically. More specifically,
Page 6
6
photographs of people’s faces were used to represent specific instances. Some
photographs of faces are recognised more successfully than others (e.g. Carbon,
2008), and because they vary in the degree to which they are perceived to resemble
the person (Jenkins et al., 2011), it is likely that different instances will produce
greater- or lesser-quality composites. Therefore, evaluating the impact of target-
image choice on facial-composite construction provides important information for
researchers who attempt to understand and improve the effectiveness of composite
images and systems.
EXPERIMENT
The experiment was carried out in four stages: selection of target images
(Stage 1), construction of composites (Stage 2), naming of composites (Stage 3) and
ratings of composites’ pictorial match with the target (Stage 4), as described below.
Stage 1: Selection of target images
Our aim was to use face-construction procedures in the laboratory that were
similar to the applied context (e.g. Frowd, Carson, Ness, McQuiston et al., 2005). We
therefore chose target identities that were unfamiliar to participants who would
construct the composites, but thereafter familiar to the judges who would attempt to
name the composites. This aim was facilitated using celebrity targets that were
familiar to participants living in Australia but were largely unknown in the UK.
We were interested in collecting a good range of variation in individual
instances for a number of identities, and so collected 12 photographs of each of 40
Australian celebrities. Australian participants who were familiar with them were
Page 7
7
asked to provide likeness ratings indicating the extent to which they looked like the
relevant person. From each set of 12 photographs belonging to the same identity, we
produced an average (prototype), and then selected three photographs of each
celebrity representing poor, medium, and good likeness (see Materials for more
details).
It was anticipated that composites produced from good-likeness images would
be recognised better than composites produced from poor likenesses, with medium-
likeness targets producing intermediate-quality composites. In addition, because the
averaging process has the effect of removing image-specific variance, it was
hypothesised that composites based on prototype targets would be more recognisable
than composites based on images from the three likeness categories.
Method
Participants
Twelve (nine female) undergraduate and post-graduate students from the
University of New South Wales (Australia) volunteered to participate in Stage 1.
Participants’ age ranged from 19 to 27 years (M = 23.7 years, SD = 2.1 years), and
they participated in exchange for course credit or a small cash incentive.
Materials
For each of 40 Australian celebrities (20 males and 20 females), 12 images
were downloaded from the Internet (480 images in total). The images were collected
via the Google Image search engine using celebrities’ names as search terms and so
varied in terms of image quality, lighting, background, head pose and facial
Page 8
8
expression. We accepted the first 12 colour images of each face that: i) exceeded 150
pixels in height, ii) had a somewhat frontal aspect and iii) were free from occlusions.
An image average (prototype) was created for each celebrity using the procedure
described by Burton et al. (2005). This involved morphing each celebrity photograph
to a standard shape template using in-house software to align facial features across the
set. Mean values were then calculated for corresponding pixels before the resultant
‘shape-free’ image was morphed to the average shape for that celebrity. Because this
averaging process generates images that are cropped around the internal features of
the face (excluding ears, hair and face-shape), we also removed the external features
of all 480 photographs in the same way. Example stimuli are shown in Figure 1.
Figure 1. Image A is an example of a face prototype (image average) of the current
Australian Prime Minister, and B is a selection of the photographs that were used to
create this image (12 were used in our experiment). Details of the procedure used to
create the prototype can be found in Burton et al. (2005); in the experiment colour
images were used. For reasons of copyright, we are not able to show the target
photographs used in our experiment.
Page 9
9
Design and Procedure
Participants were tested individually. The 480 photographs and the 40 image
prototypes were presented blocked by identity, in an attempt to avoid making the task
too disjointed (which may have occurred if identity was randomised across the set),
and participants saw the 12 photographs and the prototype of each celebrity
sequentially. They were given a different random block order, and image order
within the block was also random for each person. The celebrity’s name was
displayed below each image to avoid ambiguity regarding identity. For each image,
participants were asked to provide a likeness rating using an on-screen scrollbar
labelled at end-points, “nothing like them" (rating value of 1) and "perfect likeness"
(100). Images were presented centrally to dimensions of 6.5 cm wide by 9.5 cm high.
If a participant was not familiar with a particular celebrity this was indicated by
clicking on a button labelled ‘unfamiliar’, and image presentation resumed from the
start of the next block. The task was self-paced and each image remained visible until
a response was made. Testing sessions lasted for approximately 30 minutes per
participant.
Results
Mean likeness ratings were calculated for individual photographs and for
prototypes. As most composites created in police investigations are male, the focus
was on this target gender here (with female targets set aside for other projects). Also,
as having generally identifiable targets was important for composite naming in Stage
3, six identities were excluded that were not well known—in this case, those who
were identified as familiar by less than 65% of participants. Two further photographs
Page 10
10
had distinctive facial hair and so images from both of these celebrities were omitted,
to avoid producing composites that would have been too unusual in this respect.
Based on a G*Power analysis (Faul et al., 2007), we estimated that eight
targets would provide a practically-useful, large effect size (f = .4) with very-good
power (1-β = .9) for the planned by-items analysis in the composite evaluation
(naming) stage (parameter settings: α = .05, Repetitions = 4, Groups = 1, r = .7, ε =
1.0). Thus, eight identities were selected at random from the remaining 12 male faces
(see Appendix). The mean rated likeness of photographs (instances) of the selected
faces ranged from 33 to 94 (M = 64.7, SD = 14.3). Skew (-0.05) and kurtosis (-0.70)
were within the expected range for a Normal distribution.
For each celebrity, we selected instances corresponding to the lowest and the
highest mean-rated likeness, as well as the photograph that was closest to the mean
likeness for the relevant celebrity. We refer to these photographs as having low, high
and medium likeness, respectively. Mean ratings of selected photographs was 43.5
(SD = 6.2) for low, 67.2 (SD = 4.8) for medium and 85.2 (SD = 6.6) for high likeness;
for prototype images, it was 83.9 (SD = 6.8). Repeated-Measures Analysis of
Variance (ANOVA) on these by-items likeness ratings was significant for the
Mauchly's Test of Sphericity [Mauchly's W(5)= 0.06, ε = .49, p = .007], indicating
unequal differences between category variances—here, ratings are relatively less
variable in the medium category (which is not a problem in itself). Degrees of
freedom were adjusted using the Greenhouse-Geisser correction, and the ANOVA
was significant for target type [F(1.5,10.3) = 104.0, p < .001, ω2 = .89].
Page 11
11
Repeated contrasts of the ANOVA appropriately found that rating categories
of photographs were greater both for medium than for low [t(7) = 25.1, p < .001, dc(1)
= 3.7], and for high than for medium [t(7) = 8.7, p < .001, dc = 3.0]. There was
equivalence between prototype and high [t(7) = 0.3, p = .75], which was expected,
since research has found a null effect between equivalent categories for images
averaged in the same way (Bruce, Ness, Hancock, Newman & Rarity, 2002). Two
further contrasts were conducted, with Bonferroni correction applied (α = .05/2 =
.025), which indicated that prototypes were rated higher than both medium [t(7) = 5.8,
p < .001, dc = 2.8] and low categories [t(7) = 11.6, p < .001, dc = 6.2].
Stage 2: Composite construction
Composites were constructed using EvoFIT computer software. The
underpinnings of this system were first described in prototype form in Hancock
(2000), and EvoFIT has now been the focus of extensive research and development.
For a detailed description of the main technical aspects, refer to Frowd, Carson and
Hancock (2004), while a summary of milestones in development, which particularly
relate to the psychology of face construction, can be found in Frowd (2012) or Frowd,
Skelton, Atherton and Hancock (2012).
EvoFIT is one of two main commercial implementations that create a
composite based on the natural processes of selection and breeding: the other system
is EFIT-V (e.g. Valentine et al., 2010). With EvoFIT, face constructors repeatedly
select faces from arrays of complete faces, with ‘breeding’, to ‘evolve’ a composite.
1 To avoid over-estimating the standard effect size (Cohen’s d) for correlated contrasts, dc is calculated using Equation 3 of Dunlap, Cortina, Vaslow and Burke (1996).
Page 12
12
They first select items which resemble the target in terms of facial ‘shape’,
specifically shape and placement of individual features, and then in terms of facial
‘texture’, greyscale colouring of eyes, brows and overall appearance of the skin.
Selected choices are combined (using genetic cross-over and mutation operations) and
the process is repeated. Once a face has been evolved, software tools are used to
improve the perceived match to the target for age, weight, masculinity and other
overall properties; constructors may also manipulate the shape and placement of
individual features.
EvoFIT was also chosen as it can readily produce faces containing just internal
features. Such images are not only a better match to our target pictures (that contain
principally internal features), but represent the region that is most important when
another person recognises a photograph of a familiar face (e.g. Ellis, Shepherd &
Davies, 1979) or names a facial composite (e.g. Frowd, Bruce, McIntyre, et al., 2007;
Frowd, Skelton et al., 2011)—see Figure 2 for examples of this facial region. Indeed,
constructing internal features first (without seeing external features) and then adding
external features afterwards produces much more identifiable faces (M = 46% correct
naming) with EvoFIT than when both internal and external features are constructed
simultaneously (M = 23%) (Frowd, Skelton, Atherton, Pitchford et al., 2012). In a
recent police audit, this approach was shown to be effective, with 60% of EvoFIT
composites directly leading to identification of a suspect, and one in six cases overall
leading to conviction (Frowd, Pitchford et al., 2012).
Method
Participants
Page 13
13
Thirty-two (17 female) volunteer students from the University of Central
Lancashire participated in the Composite construction. Their ages ranged from 18 to
55 years (M = 26.1 years, SD = 10.1 years), and were recruited on the basis of being
unfamiliar with Australian celebrities. Eight participants were assigned to each of the
four levels of the between-subjects factor, target type (low, medium and high likeness;
prototype).
Materials
Target stimuli for face construction were 24 photographs (eight celebrities
each at low, medium and high likeness) and eight prototypes of Australian celebrities
as selected in Stage 1. Each image was printed to dimensions of approximately 8 cm
(high) to 5 cm (wide), in colour, on single sheets of A4 paper. EvoFIT version 1.5
was used to produce the composites.
Design and Procedure
Participant-constructors were tested individually throughout. They were
randomly assigned to one of four different target types. Participants in each of the
four conditions were shown the same eight celebrities, but in a different target image
condition (low, medium and high likeness; prototype). Participants were recruited on
the basis of being unfamiliar with Australian celebrities. They were presented with a
randomly-selected target image and asked whether it was familiar (no one reported
the face was known; had they done so, another image would have been shown).
Participants were instructed to study the face for 60 seconds. Afterwards, they were
told that a composite of this face would be constructed the following day.
Page 14
14
Between 20 and 28 hours after the target had been presented, the experimenter
administered a cognitive interview, a standard technique used by forensic
practitioners for obtaining a detailed description of the face (e.g., see Frowd, Nelson
et al., 2012). This involved participant-constructors being asked to visualise the target
face and then freely describe it in their own time and in as much detail as possible,
without guessing; the experimenter also mentioned that he would not interrupt while
this was being carried out but would note down what was said. While this face-recall
task has the potential to interfere with participants’ recognition ability, known as the
verbal-overshadowing effect (Schooler & Engstler-Schooler, 1990), its use reflects
practice for police practitioners; however, indications are that the size of the effect is
small for composites (e.g. Frowd & Fields, 2010). Next, constructors were given a
brief overview of EvoFIT and the procedure used to construct a composite with this
system. The experimenter operated the software and presented the necessary screens.
The face-construction procedure is described briefly above (introduction to Stage 2)
and in detail in Frowd, Skelton, Atherton, Pitchford et al. (2012, Experiment 3). Each
participant produced a single composite, resulting in a total of 32 composites (8
identities x 4 target types). Face-construction sessions took about an hour to complete
per person.
Stage 3: Composite evaluation (naming)
In this stage, composites were evaluated by asking participants familiar with
the targets to name the composites.
Method
Page 15
15
Participants
Seventy-two students (41 female) from the University of New South Wales
volunteered for the composite evaluation task. Their ages ranged from 18 to 57 years
(M = 23.1 years, SD = 9.1 years) and were recruited on the basis of being generally
familiar with Australian celebrities. Participants were assigned equally to the four
levels of the between-subjects factor, target type.
Materials
Stimuli were 32 individual composites (constructed from 24 photographs and
8 prototype stimuli) and the 32 target pictures used to create these in the previous
stage. Example composites constructed are presented in Figure 2. Four testing sets
were prepared for naming, with each containing composites produced from one
target-image category (low, medium and high likeness; and prototype). Each set
contained eight A4 pages with a single composite printed on each from that condition.
Composites were printed in greyscale, the modality of EvoFIT images, and measured
approximately 8 cm (high) x 6 cm (wide). Target pictures were reproduced in colour,
one image per page, at 8 cm x 7 cm.
(a) (b) (c) (d) (e)
Page 16
16
Figure 2. Example likenesses produced in the study of the Australian television
personality, Bert Newton. They were created by different people (participant-
constructors) who saw a target image categorised as (a) low likeness, (b) medium
likeness, (c) high likeness and (d) prototype. Image (e) is the prototype target image
presented (in colour) to the participant who constructed composite image (d).
Design and Procedure
In a criminal investigation, the most valuable outcome for a composite is for
someone who is familiar with the face to correctly name it to the police. To simulate
this process, we recruited participants on the basis of being familiar with the relevant
targets and asked them to name the composites. Four testing sets were prepared, each
one containing the eight composites produced from one target type (low, medium and
high likeness; and prototype). As it is possible for carry-over effects to artificially
inflate naming levels when seeing more than one composite of the same identity,
participants inspected composites from one type of target only. They were therefore
randomly assigned to one of these four testing sets in a between-subjects design.
Participants first attempted to name each composite, a task we refer to as
‘spontaneous’ naming. Next, to gauge the extent to which participants were familiar
with the relevant identities, and to check for systematic bias by target type, they then
named the eight photographs that were used to construct their assigned set of
composites (target naming) and indicated which identities they were familiar with
from a list of written celebrity names (name recognition).
Previous studies have indicated that participants find it difficult to name
composites spontaneously in the absence of external features, and produce few correct
names (Frowd, Herold, Duckworth & Hassan, 2012). Correct naming of internal
Page 17
17
features of composites is rendered more accurate, however, when participants select
identities from a list of written names—a task which we refer to as ‘constrained’
naming. The task has some ecological validity, at least in the sense that a member of
the public (or a police officer) may try to identify a composite through a process of
elimination. Research also indicates that constrained naming is a good proxy to
spontaneous naming of complete composites (Frowd, Bruce, Gannon et al., 2007;
Frowd, Nelson et al., 2012). The task was carried out after name recognition.
The number of participants (evaluators) required for the planned by-
participants naming analysis was estimated using a G*Power analysis (Faul et al.,
2007). This indicated that 72 people in total were needed to achieve a large effect
size (f = .4) with good power (α = .05, Groups = 4, 1-β = .8).
Participant-evaluators were tested individually, and the task was self-paced.
They were told that facial composites of Australian celebrities would be shown for
them to name, or guess if unsure; participants were also told that ‘don’t know’
responses were acceptable. Participants were randomly assigned, with equal
sampling, to one of four testing sets of composites (for low, medium and high
likeness; and prototype). The eight composites from the assigned set were presented
sequentially, and evaluators offered a name for each or a ‘don’t know’ response. The
eight target photographs or prototypes used to construct the composites (from the
assigned set) were then presented sequentially and participants were asked to name
those. Next, participants indicated which identities they were familiar with, using a
written list of the eight celebrities’ names. Finally, the composites were presented
again, in the same order as before, and participants were asked to select the correct
identities from the written list—an eight alternative-force-choice task. The order of
presentation of composites and target pictures was randomised for each person. No
Page 18
18
feedback was given as to the accuracy of responses. Testing sessions lasted for
approximately ten minutes, after which participants were debriefed with the aims of
the experiment.
Results
Participants’ data were checked for missing data (of which no such cases were
found) and then scored for accuracy for each naming task separately: spontaneous
naming of composites, spontaneous naming of targets, recognition of written target-
names and constrained naming of composites. These data are summarised in Table 1.
Table 1. Performance of each sub-task completed during naming of the composites’
internal-features region. Data are grouped by target type and values are expressed in
percentage correct.
Target type
Type of task Low Medium High Prototype Mean
Spontaneous composite naming 1.8 (0.7)
1.3 (0.7)
2.6 (0.7)
1.3 (0.5)
1.7 (0.4)
Spontaneous target naming 52.8 (9.9)
58.3 (7.9)
60.4 (9.5)
61.1 (10.5)
58.2 (2.6)
Name recognition 88.9 (4.9)
84.0 (6.4)
87.5 (6.6)
84.7 (6.1)
86.3 (1.1)
Constrained composite naming 15.3 (3.9)
37.5 (7.8)
43.8 (6.8)
44.4 (6.8)
35.2 (6.0)
Note. Figures in parentheses are standard errors of the by-item means.
We initially analysed the effect of target type (low, medium and high likeness;
and prototype) on both spontaneous target naming and name recognition. This
Page 19
19
analysis is necessary to ensure that the random allocation of participants to target type
had eliminated group differences in target familiarity—if not, this could influence the
composite-naming analysis.
As can be seen in Table 1, mean values increased somewhat across the
categories of low, medium and high for target naming (second row of data), but they
varied little by condition for name recognition (third row). ANOVA was not
significant for either target naming [by-participants, F1(3,68) = 0.5, p = .71; by-items,
F2(3,21) = 1.5, p = .25] or name recognition [F1(3,68) = 0.4, p = .77; F2(3,21) = 1.2, p
= .32], thus providing no evidence of systematic differences in familiarity between
subject groups. We note that these non-significant results also suggest that it is
unlikely that the different categories of image differed in their respective ‘iconicness’,
which has been shown to promote superior recognition performance (Carbon, 2008).
The second analysis considered naming of composites. Spontaneous naming
scores were calculated by dividing the number of correct responses for each
composite by the number of correct responses for the relevant target picture (Table 1,
first row). As expected, percentage-correct means were low and therefore, due to the
floor effect observed in this task, we did not conduct an inferential analysis.
Constrained-naming scores for composites were calculated in the same was as
for spontaneous naming of composites (correct responses of composites divided by
correct responses of targets). These data indicated a somewhat equal increase from
low to medium to high categories; in addition, naming scores in the prototype
condition were somewhat higher than in the medium and about the same as in the
high category. ANOVA of these scores was significant for target type [Between-
subjects: F1(3,68) = 10.1, p < .001, ω2 = .28; Within-subjects: F2(3,21) = 6.3, p =
.003, W(5) = 0.39, ω2 = .26], a result that generalizes both by participants and by
Page 20
20
items at the same time [minF'(3,49) = 3.9, p = .014] (e.g. Clark, 1973; Raaijmakers,
Schrijnemakers & Gremmen, 1999).
Repeated contrasts of the ANOVA indicated that composites were named
significantly higher for medium than for low likeness targets [by-participants, t1(34) =
4.4, p < .001, d = 1.4; by-items, t2(7) = 2.5, p = .039, dc = 1.3], but there was no
significant difference between composite naming of medium and high likeness targets
[t1(34) = 1.0, p = .33; t2(7) = 0.8, p = .43]. Composites were also identified
significantly better in the high than the low likeness category [t1(34) = 4.8, p < .001, d
= 1.8; t2(7) = 3.3, p = .014, dc = 1.8]. In addition, we carried out a more sensitive
polynomial trend analysis (in the category order of low, medium and high) to further
explore the relationship between these three variables. This was reliable as a linear
[by-participants, p1 < .001; by-items, p2 = .014] but not a quadratic trend [p1 = .11; p2
= .29], indicating that composites were more recognisable as the similarity match of
the target increased from low to medium to high likeness.
A repeated contrast further indicated equivalence between composites based
on prototype and high likeness targets [t1(34) = 0.1, p = .72; t2(7) = 0.2, p = .85]. Two
further contrasts were conducted, with Bonferroni correction applied (α = .05/2 =
.025), which indicated that composites of prototypes were also equivalent to
composites of medium likeness targets [t1(34) = 0.1, p = .92; t2(7) = 1.0, p = .37], but
were superior to composites of low likeness targets [t1(34) = 3.8, p < .001, d = 1.6;
t2(7) = 3.2, p = .014, dc = 1.9].
Stage 4: Composite evaluation (Similarity ratings)
Page 21
21
Modern composite systems (EvoFIT included) generally produce frontal faces
that are evenly lit with a neutral expression. Our target photographs naturally varied
from this standard and included image-specific characteristics which were likely to
serve as a source of distraction to participant-constructors—stimuli which contain
pictorial properties that we have argued are a source of variability and interfere with
face construction. In contrast, the prototypes contain fewer of these image-specific
characteristics and, as a result, we expected them to be more like the faces produced
from the composite system. To test for this possibility, participants in this stage were
asked to rate the similarity between each target and its corresponding composite for
pictorial properties. Three properties were chosen that are known to influence
unfamiliar-face recognition (see Introduction), and they also appeared to be a major
source of variation in our target set: lighting, head pose and facial expression.
Method
Participants
Eighteen staff and student volunteers (11 females) from the University of
Central Lancashire were recruited for the similarity-rating task. As with participants
that constructed the composites, they were selected on the basis of being unfamiliar
with Australian celebrities. Their ages ranged from 18 to 54 years (M = 27.0 years,
SD = 11.2 years). None had participated in any other phase of the study.
Materials
Thirty-two composites were printed, one per page, alongside the relevant
target photograph or prototype. Target photographs and prototypes were printed in
greyscale, and to the same dimensions as in Stage 3.
Page 22
22
Design and Procedure
Each person was presented with either target picture or prototype alongside
the corresponding composite and rated similarity on a 10 point scale. Tasks were
blocked by rating scale (lighting, pose and expression) and participants were
presented with all 32 composite-target pairs in each block (block order was
counterbalanced across subjects). The design was thus repeated-measures for rating
scale and target type (low, medium and high likeness; and prototype).
Participant-evaluators were tested individually. They were requested to rate
the accuracy of a set of composites using a 10-point scale (1 = very-poor match / 10 =
very-good match) according to how well a target face and a composite constructed of
it matched in terms of lighting, head pose and facial expression. Composite-target
pairs were presented sequentially, in a different random order for each person, and
evaluators provided a rating score for each pair in their own time. Testing sessions
lasted for about 20 minutes. Participants were debriefed on the experimental aims.
Results
Rating data are summarised in Table 2.
Table 2. Mean matching scores of targets (classified into low, medium and high
likeness; and prototype) and their composites in terms of lighting, head pose and
facial expression. Values range from 1 (very-poor match) to 10 (very-good match).
Page 23
23
Target type
Rating type Low Medium High Prototype Mean
Lighting 3.8 (0.2)
4.3 (0.4)
4.0 (0.2)
5.2 (0.3)
4.3 (0.2)
Pose 4.6 (0.6)
5.5 (0.3)
5.2 (0.6)
6.9 (0.3)
5.5 (0.4)
Expression 3.3 (0.6)
3.9 (0.3)
3.9 (0.6)
4.4 (0.3)
3.9 (0.4)
Mean 3.9 (0.4)
4.5 (0.3)
4.4 (0.4)
5.5 (0.2)
4.6 (0.2)
Note. Figures in parentheses are standard errors of the by-item means.
Mean rating scores were analysed using 3 [Rating (within-subjects):
expression, lighting, pose] x 4 [Target (within-subjects): low, medium, high,
prototype] ANOVA. This analysis was significant for both rating [F1(1.6,27.9) =
22.1, W(2) = 0.7, ε (Huynh-Feldt) = 0.8, p < .001, ηp2 = .57; F2(2,14) = 11.1, W(2) =
0.5, p = .001, ηp2 = .61; minF'(2,28) = 7.4, p = .011] and target type [F1(3,51) = 78.2,
W(5) = 0.5, p < .001, ηp2 = .82; F2(3,21) = 4.1, W(5) = 0.2, p = .020, ηp
2 = .37;
minF'(3,23) = 3.9, p = .021], but not for the interaction [F1(6,102) = 3.7, p = .002, ηp2
= .18; F2(6,42) = 0.7, p = .62].
For target type, repeated contrasts revealed no significant difference between
either low and medium (p1 < .001; p2 = .33) or between medium and high likeness
targets (p1 = .041; p2 = .67), but (unlike naming) ratings of match were reliably higher
for prototype than for high-likeness targets (p1 < .001, dc = 1.1; p2 = .029, dc = 1.2).
Repeated contrasts indicated that rating of match between composites and
targets was closer for pose than for both lighting [p1 = .001, dc = 1.1; p2 < .001, dc =
1.0] and expression [p1 < .001, dc = 1.5; p2 = .008, dc = 1.8]; ratings between lighting
and expression did not differ significantly [t1(17) = 2.6, p = .017; t2(7) = 1.2, p = .27].
Page 24
24
We next assessed the impact on composite naming of target type (photograph,
prototype) and pictorial variation (lighting, pose and expression). This was achieved
using a stepwise linear regression with pictorial variation as predictor variables and
constrained naming as the DV. The backward method was chosen, a technique which
begins with all variables and iteratively removes those without useful contribution
(criteria for removal, p > .1); the method has the benefit of revealing suppressor
variables: variables that are influenced by the presence of other variables. We note
that Multicollinearity was not an issue here as pictorial variables were not too highly
correlated with each other (all r < .6). We also included a dichotomous variable that
coded whether the composite’s target was an individual photograph or prototype. The
model achieved a good fit [F(2,31) = 6.9, p = .003, R2 = .32, Durbin-Watson d = 2.1]
with two equally-weighted positive correlations for pose [B = 0.04, SE(B) = 0.02,
r(part) = .30, VIF = 1.3] and for expression [B = 0.05, SE(B) = 0.03, r(part) = .30,
VIF = 1.3]. This suggests that composites were much-more identifiable when they
were a better match to the target by pose and expression, but not by lighting.
Discussion
We have assessed the potential impact of target representation on face
construction using the EvoFIT face evolving system. Our data provide partial support
for our main experimental hypotheses. It was anticipated that composites based on
the memory of a prototype image would be more recognisable than composites from
any of the three target likeness categories (low, medium or high), but this was only
found to be true for the low likeness category: composite identification was
comparable between prototype and the other two categories. So, the data do not
provide strong support for an advantage of a prototype, which would have required it
Page 25
25
to be superior to the three likeness categories. Instead, composites were only superior
for prototypes compared with the lowest likeness category. It was also expected that
composites would be more identifiable from low to medium, and from medium to
high likeness categories. Again, the main analysis implicated benefit relative to the
lowest category: there was a reliable increase in composite identification when targets
were of medium relative to low likeness, but there was equivalence between the
medium and high. In a more sensitive test, a polynomial-trend analysis did provide
overall support for successively higher identification from low to medium to high
target likeness categories.
So, the findings provide some evidence to support the notion that memory for
unfamiliar faces includes characteristics of a particular image, although the evidence
is not as strong as was initially anticipated. While there was a reliable difference in
composite identification between low likeness and prototype targets, the strongest
support is from the trend analysis which indicated that higher categorical likeness of
targets led to more identifiable composites. This result does not appear to be a
consequence of increasing likeness for the targets since there was no evidence of
reliable category differences by either spontaneous target naming or name recognition
in Stage 3.
The linear-regression analysis (Stage 4) indicated a strong relationship for
match of composites and targets by both head pose and facial expression. These two
variables produced semipartial correlations that were positive and medium sized,
indicating importance for face production. For EvoFIT, while its faces have a fairly-
neutral expression during the evolving stage, constructors had some control over
expression in later holistic-tool use. It is possible therefore for them to add a smile or
a frown, to match their memory of the target’s expression. This enhancement may
Page 26
26
have helped to create a more identifiable likeness. In contrast, it is not currently
possible to vary head pose in EvoFIT—all images are rendered in a front-face view.
So, targets with more frontal angle-of-view were constructed more identifiably: those
that required constructors to carry out mental rotation did worse, presumably due to
errors introduced when we process novel views of unfamiliar faces (e.g. Bruce, 1982;
Longmore et al., 2008; Valentine & Bruce, 1988).
This issue is relevant to the real-world application where eyewitnesses may
not have seen an offender’s face front-on, but are required to produce a front-view
composite. Seeing an offender, for instance, through the side window of a car or from
a raised elevation is unlikely to provide a frontal view of the face. It would appear
worthwhile, therefore, for composite systems to be able to render images at a specific
view to match the eyewitness’s memory, an idea for which there is already some
experimental support (Ness, 2003). Note also the sensitivity of head pose: changes in
angle of view for our targets varied within about 30 degrees (in any direction), and yet
this was sufficient to produce interference with face construction—for unfamiliar-face
recognition using actual photographs of faces, measurable decrements have been
found with similar changes in head pose (e.g. Longmore et al., 2008). A system
enabling faces to be constructed at a precise angle-of-view does appear to exist, at
least in development (Blanz, Albrecht, Haber & Seidel, 2006). In addition to having
control of angle of view, this particular system is able to cope with a range of
expressions, and so could potentially be used to minimise the effect of exposure-
specific appearance reported here.
The other main implication of our results is not one of direct practical
importance, but instead relates to the methodology used in research and assessments
involving facial composites and their production systems. Given that the likelihood
Page 27
27
of a composite being recognised by those familiar with the target depends on the
likeness of the image to which the witness was exposed, it would seem sensible for
forensic researchers to take care when choosing stimuli: they should establish the
likeness of stimuli with people who are familiar with the target identities. This is
particularly important for targets that are perceived as being of low likeness, since
such images produce markedly worse quality composites than composites from
medium and high likeness targets. Indeed, of the 240 male images we collected on
the Internet, around 30% of them fell into this category. Therefore, given the large
extant literature using image-memory paradigms to evaluate facial-composite systems
(e.g. Frowd, Bruce, McIntyre & Hancock, 2007), we wondered whether the current
experiment might bring the reliability of this literature into question?
In order to address this concern, we carried out a review of studies that have
used a common methodology with EvoFIT (including a nominal 24 hr delay between
target exposure and face construction) that have used different targets; these were
either videos (Frowd, Nelson et al., 2012; Frowd, Skelton, Hepton et al., in press) or
photographs (Frowd, Pitchford et al., 2010; Frowd, Skelton, Atherton, Pitchford et al.,
2012). In the equivalent condition in these four studies, mean naming for complete
composites was approximately 25% correct and varied by less than 2%, thus
indicating good consistency by target image mode and good reliability.
While published research on EvoFIT has been considerable to date (see
Frowd, 2012), there is at least some formal research carried out on the other main
commercial evolving system, EFIT-V (e.g., Gibson, Solomon, Maylin & Clark,
2009). The evidence is that this approach behaves similarly to a more traditional
‘feature’ system when tested after a short target delay (Frowd, Carson, Ness,
Richardson et al., 2005; Valentine et al., 2010): laboratory performance following a
Page 28
28
more forensically-relevant delay is unknown. However, given that EFIT-V operates
in a broadly similar way to EvoFIT, one would anticipate that the results found here
would generalise to this other system. Our findings, in particular to the
methodological issue related to selection of target images, would also appear to apply
to the assessment of feature systems.
To summarize, the results reported here have theoretical value for the study of
face memory in general. Some benefit was found to suggest that average
representations (prototypes) are more suitable for the purpose of identification
(Burton et al. 2005; Jenkins et al., 2011), at least in terms of the prototypes yielding
EvoFIT composites that were more identifiable than composites of low rated likeness
targets, and the graded effect (in the trend analysis) such that better likeness images
produced better memory constructions. These results suggest that previous failed
attempts to detect cognitive effects of image likeness may be due to the particular
methodology used (e.g. Johnston & Barry, 2001). Care should be taken to standardize
procedures for image selection when carrying out research using facial composites
and facial-composite systems. The results also indicate that achieving a good match
by target pose and expression (but not lighting) is likely to achieve a more-identifiable
image for practitioners than if these properties are not correctly rendered in the face.
In addition, system designers could usefully enhance their software to allow inclusion
of pictorial information.
Page 29
29
References
Blanz, V., Albrecht, I., Haber, J., & Seidel, H.P. (2006). Creating Face Models from
Vague Mental Images. Computer Graphics Forum, 25, 645-654.
Brace, N., Pike, G., & Kemp, R. (2000). Investigating E-FIT using famous faces. In
A. Czerederecka, T. Jaskiewicz-Obydzinska & J. Wojcikiewicz (Eds.). Forensic
Psychology and Law (pp. 272-276). Krakow: Institute of Forensic Research
Publishers.
Bruce, V. (1982). Changing faces: Visual and non-visual coding processes in face
recognition. British Journal of Psychology, 73, 105-116.
Bruce, V., Henderson, Z., Greenwood, K., Hancock, P.J.B., Burton, A.M., & Miller,
P. (1999). Verification of face identities from images captured on video. Journal
of Experimental Psychology: Applied, 5, 339 360.
Bruce, V., Ness, H., Hancock, P.J.B., Newman, C., & Rarity, J. (2002). Four heads
are better than one. Combining face composites yields improvements in face
likeness. Journal of Applied Psychology, 87, 894-902.
Burton, A.M., Jenkins, R. Hancock, P.J.B. and White, D. (2005). Robust
representations for face recognition: The power of averages. Cognitive
Psychology, 51, 256-284.
Burton, A., Wilson, S., Cowan, M., & Bruce, V. (1999). Face recognition in poor-
quality video: evidence from security surveillance. Psychological Science, 10,
243–248.
Carbon, C.C. (2008). Famous faces as icons: the illusion of being an expert in the
recognition of famous faces. Perception, 37, 801-806.
Page 30
30
Clark, H.H. (1973). The language-as-fixed-effect fallacy: A critique of language
statistics in psychological research. Journal of Verbal Learning and Verbal
Behavior, 12, 335-359.
Davies, G.M., & Milne, A. (1982). Recognizing faces in and out of context. Current
Psychological Research, 2, 235-246.
Davies, G.M., van der Willik, P., & Morrison, L.J. (2000). Facial Composite
Production: A Comparison of Mechanical and Computer-Driven Systems.
Journal of Applied Psychology, 85, 119-124.
Dunlap, W. P., Cortina, J. M., Vaslow, J. B., & Burke, M. J. (1996). Meta-analysis of
experiments with matched groups or repeated measures designs. Psychological
Methods, 1, 170-177.
Faul, F., Erdfelder, E., Lang, A.G., & Buchner, A. (2007). G*Power 3: A flexible
statistical power analysis program for the social, behavioural, and biomedical
Sciences. Behavior Research Methods, 39, 175-191.
Frowd, C.D. (2012). Facial Recall and Computer Composites. In C. Wilkinson and C.
Rynn (Eds). Facial Identification (pp. 42 – 56). Cambridge University Press:
New York.
Frowd, C.D., Bruce, V., Gannon, C., Robinson, M., Tredoux, C., Park., J., McIntyre,
A., & Hancock, P.J.B. (2007). Evolving the face of a criminal: how to search a
face space more effectively. In A. Stoica, T. Arslan, D.Howard, T. Kim and A.
El-Rayis (Eds.) 2007 ECSIS Symposium on Bio-inspired, Learning, and
Intelligent Systems for Security, (pp. 3-10). NJ: CPS. (Edinburgh).
Frowd, C.D., Bruce, V., McIntyre, A., & Hancock, P.J.B. (2007). The relative
importance of external and internal features of facial composites. British Journal
of Psychology, 98, 61-77.
Page 31
31
Frowd, C.D., Carson, D., Ness, H., McQuiston, D., Richardson, J., Baldwin, H., &
Hancock, P.J.B. (2005). Contemporary Composite Techniques: the impact of a
forensically-relevant target delay. Legal & Criminological Psychology, 10, 63-81.
Frowd, C.D., Carson, D., Ness, H., Richardson, J., Morrison, L., McLanaghan, S., &
Hancock, P.J.B. (2005). A forensically valid comparison of facial composite
systems. Psychology, Crime & Law, 11, 33-52.
Frowd, C.D., & Fields, S. (2010). Verbal overshadowing interference with facial
composite production. Psychology, Crime and Law, 17, 731-744.
Frowd, C.D., Hancock, P.J.B., & Carson, D. (2004). EvoFIT: A holistic, evolutionary
facial imaging technique for creating composites. ACM Transactions on Applied
Psychology (TAP), 1, 1-21.
Frowd, C.D., Herold, K., Duckworth, L., & Hassan, A. (2012). The impact of hair for
the construction and recognition of facial-composite images. Manuscript under
revision.
Frowd, C.D., Nelson, L., Skelton F.C., Noyce, R., Atkins, R., Heard, P., Morgan, D.,
Fields, S., Henry, J., McIntyre, A., & Hancock, P.J.B. (2012). Interviewing
techniques for Darwinian facial composite systems. Applied Cognitive
Psychology, DOI: 10.1002/acp.2829.
Frowd, C.D., Pitchford, M., Bruce, V., Jackson, S., Hepton, G., Greenall, M.,
McIntyre, A., & Hancock, P.J.B. (2010). The psychology of face construction:
giving evolution a helping hand. Applied Cognitive Psychology. DOI:
10.1002/acp.1662.
Frowd, C.D., Pitchford, M., Skelton, F., Petkovic, A., Prosser, C., & Coates, B.
(2012). Catching Even More Offenders with EvoFIT Facial Composites. In A.
Stoica, D. Zarzhitsky, G. Howells, C. Frowd, K. McDonald-Maier, A. Erdogan,
Page 32
32
and T. Arslan (Eds.) IEEE Proceedings of 2012 Third International Conference
on Emerging Security Technologies, DOI 10.1109/EST.2012.26 (pp. 20 - 26).
Frowd, C.D., Skelton F., Atherton, C., Pitchford, M., Hepton, G., Holden, L.,
McIntyre, A., & Hancock, P.J.B. (2012). Recovering faces from memory: the
distracting influence of external facial features. Journal of Experimental
Psychology: Applied, 18, 224-238.
Frowd, C.D., Skelton, F., Atherton, C., & Hancock, P.J.B. (2012). Evolving an
identifiable face of a criminal. The Psychologist, 25, 116 – 119.
Frowd, C.D., Skelton, F., Butt, N., Hassan, A., & Fields, S. (2011). Familiarity effects
in the construction of facial-composite images using modern software systems.
Ergonomics, DOI: 10.1037/a0027393.
Frowd, C.D., Skelton F., Hepton, G., Holden, L., Minahil, S., Pitchford, M.,
McIntyre, A., Brown, C., & Hancock, P.J.B. (in press). Whole-face procedures
for recovering facial images from memory. Science and Justice.
Gibson, S.J., Solomon, C.J., Maylin, M.I.S., & Clark, C. (2009). New methodology in
facial composite construction: from theory to practice. International Journal of
Electronic Security and Digital Forensics, 2, 156-168.
Hill, H., & Bruce, V. (1996). Effects of lighting on the perception of facial surfaces.
Journal of Experimental Psychology: Human Perception and Performance, 22,
986-1004.
Jenkins, R., & Burton, A. M. (2008). 100% accuracy in automatic face recognition.
Science, 319, 435.
Jenkins, R., White, D., Van Montfort, X., & Burton, A. M. (2011). Variability in photos
of the same face. Cognition, 121, 313-323.
Johnston, R. A. & Barry, C. (2001). Best face forward: Similarity effects in repetition
Page 33
33
priming of face recognition. The Quarterly Journal of Experimental Psychology
Section A, 54, 383-396.
Liu, C. H., & Ward, J. (2006). Face recognition in pictures is affected by perspective
transformation but not by the centre of projection. Perception, 35, 1637-1650.
Longmore, C.A., Liu, C.H. & Young, A.W. (2008). Learning Faces From
Photographs. Journal of Experimental Psychology: Human Perception and
Performance, 34, 77–100.
Ness, H. (2003). Improving facial composites produced by eyewitnesses. Unpublished
Ph.D. thesis, University of Stirling.
Raaijmakers, J.G., Schrijnemakers, J.M., & Gremmen, F. (1999). How to deal with
“the languageas-‐fixed-‐effect fallacy”: common misconceptions and alternative
solutions. Journal of Memory and Language, 41, 416–426.
Schooler, J.W., & Engstler-Schooler, T.Y. (1990). Verbal overshadowing of visual
memories: some things are better left unsaid. Cognitive Psychology, 22, 36-71.
Valentine, T., & Bruce, V. (1988). Mental rotation of faces. Memory & Cognition, 16,
556-566.
Valentine, T., Davis, J. P., Thorner, K., Solomon, C., & Gibson, S. (2010). Evolving
and combining facial composites: Between-witness and within-witness morphs
compared. Journal of Experimental Psychology: Applied, 16, 72 – 86.
Page 34
34
Appendix. Australian celebrities used in the study.
Kim Beazley
Hamish Blake
Jamie Durie
Grant Hackett
Eddie McGuire
Brendan Nelson
Bert Newton
Guy Sebastian