Page 1
Technological Forecasting & Social Change 72 (2005) 377–399
Judgment change during Delphi-like procedures:
The role of majority influence, expertise, and confidence
Gene Rowea,*, George Wrightb, Andy McCollc
a Institute of Food Research, Norwich Research Park, Norwich NR4 7UA, UKbStrathclyde Graduate Business School, University of Strathclyde, 199 Cathedral Street, Glasgow G4 OQU, UK
cDurham Business School, University of Durham, Mill Hill Lane, Durham DH1 3LB, UK
Received 30 January 2004; received in revised form 24 March 2004; accepted 29 March 2004
Abstract
This study investigates individual opinion change and judgmental accuracy in Delphi-like groups. Results
reveal that the accuracy of judgmental probability forecasts increases over Delphi rounds (in terms of proportion
correct and appropriateness of confidence) when statistical summaries or written rationales are provided from other
members of an individual’s nominal group, but does not increase in a control iteration condition (without
feedback). Additionally, subjects who gave more appropriate probability forecasts on the first round exhibited least
opinion change, although measures of confidence were unrelated to opinion change. Results also show that
majority opinion exerts strong opinion pull on minority opinion even when the majority favours an incorrect
answer (irrespective of the nature of feedback provided). The implications of these results for the utility and
conduct of the Delphi technique are discussed, in particular, with respect to selecting panellists and choosing an
appropriate feedback format.
D 2004 Elsevier Inc. All rights reserved.
Keywords: Delphi; Forecasting; Opinion change; Majority influence
1. Introduction: the Delphi technique
The Delphi technique is a forecasting tool that was developed to allow the benefits of canvassing
multiple judges without the often-corresponding deficits associated with group interaction that may arise
from social processes [1–3]. It is a structured group process, in which individuals are required to give
0040-1625/$ - see front matter D 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.techfore.2004.03.004
* Corresponding author. Tel.: +44-1603-255-125.
E-mail addresses: [email protected] (G. Rowe), [email protected] (G. Wright), [email protected]
(A. McColl).
Page 2
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399378
numerical judgments or forecasts over a number of rounds, with feedback being provided from the
anonymous other members of the panel, and the final aggregate being taken as the process output. It is
not, however, a method intended to force consensus per se—response stability rather than consensus is
the signal to cease additional polling, with disagreement (as indicated by, for example, a bipolar
distribution of responses) accepted as informative.
Delphi’s effectiveness over comparative procedures, at least in terms of judgmental accuracy, has
generally been demonstrated [3]. In a review of empirical studies of Delphi, Rowe and Wright [4] found
that Delphi groups outperformed ‘statistical’ groups (which involve the aggregation of the judgments of
noninteracting individuals) in 12 studies, underperformed these in two, and ‘tied’ in two others, while
Delphi outperformed standard interacting groups in five studies, underperformed in one, and ‘tied’ in
two. This trend is all the more impressive given that many laboratory studies of Delphi effectiveness
have used simplified versions of the technique (e.g., with limited feedback) in simplified contexts (e.g.,
using nonexpert, student subjects) that might be anticipated to undermine the virtues of the technique
[4,5]. We return to this issue shortly.
Although research suggests that Delphi allows improved judgment compared to alternative methods,
as demonstrated in these ‘technique comparison’ studies, the reasons for this are still unclear, given a
relative dearth of ‘process’ studies that have attempted to establish the precise mechanism for
improvement in Delphi [5]. In this study, we attempt to advance understanding of how Delphi improves
judgmental performance.
2. Research into the Delphi process: variables and measures
Generally, it is assumed that Delphi ‘works’ (improves judgmental accuracy) because of the feedback
provided between rounds in conjunction with its anonymity; that is, it is assumed that the lack of cues as
to the identity of the sources of the feedback allows participants to focus upon the content of that
feedback rather than being distracted by extraneous social information, so being influenced by ‘good’
information and uninfluenced by apparently ‘poor’ information. Indeed, the power of feedback was
demonstrated by Scheibe et al. [6], who gave ‘false’ feedback to panellists and found that their estimates
still converged towards this fallacious figure.
However, the extent of influence of feedback is uncertain, in the sense that the variable is confounded
in Delphi studies by the iteration variable. After all, it is possible that Delphi panellists may ignore
feedback and still improve, by using the opportunity afforded by each new round to review their own
previous estimate. Indeed, Parente et al. [7] attempted to separate out the effects of iteration and
feedback and found that, although neither iterated polling nor consensus feedback had a discernible
effect upon ‘if’ accuracy (forecasting if an event would occur), a condition involving iteration alone
resulted in improved accuracy for ‘when’ a newsworthy event would occur while a purely feedback
condition (no iteration involved) did not. Boje and Murnighan [8] also found that accuracy decreased
over rounds in a standard Delphi procedure yet improved in a purely iterative condition. These results
seem to suggest that it is iteration that is the more powerful influence leading to improved accuracy, and
not feedback.
Rowe and Wright [9] have argued, however, that the feedback used in previous studies has been
somewhat superficial. For example, the feedback in Parente et al. [7] simply comprised modes and
medians, while that from Boje and Murnighan [8] comprised estimates and their justifications, without
Page 3
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 379
average figures. They argued that better quality, more informative feedback is liable to be more
(positively) influential. Indeed, Best [10] found that for one of two task items, a Delphi group that was
given ‘Reasons’ feedback in addition to a median and range of estimates was more accurate than a
Delphi group that was provided with feedback that excluded reasons. Rowe and Wright [9] themselves
compared three feedback conditions: an ‘Iteration’ condition (involving iteration over rounds without
feedback from the members of the Delphi panel), a ‘Statistical’ feedback condition (involving median
values and range of estimates), and a ‘Reasons’ feedback condition (involving reasons from the Delphi
panellists along with their numerical estimates). They found that, although subjects were less inclined to
change their forecasts as a result of receiving Reasons feedback than other types, when they did change
forecasts, this change tended to be for the better, leading to a reduction in error. Although subjects tended
to make greater changes to their forecasts in the Iteration and Statistical conditions than in the Reasons
condition, these changes did not, in general, improve predictions.
From these studies, it may be suggested that Delphi ‘works’ partially because of the influence of
iteration, allowing panellists to reflect on previous answers, and partially because of the influence of
feedback, which is most likely to be successful when more rather than less informative. The output from
a Delphi process, however, is an aggregate of the individual judgments of the panellists, and this research
says little about who among the panellists is affected by what aspects of the technique and in what way.
Understanding the mechanics of judgment change in Delphi is likely to have implications, not only for
the best structure of Delphi (e.g., best way to provide feedback), but also for the selection of panellists.
One explanation for how Delphi works that considers the nature of panellists is the ‘Theory of Errors’.
Parente and Anderson-Parente [11] interpret this as follows: Delphi accuracy improves over rounds,
because the panellists comprise essentially two types of people: the ‘holdouts’, who do not change their
opinion (much) over rounds, and the ‘swingers’, who do. If the holdouts are the more accurate (expert)
panellists, then it can be mathematically shown that if the swingers alter their judgments towards the
average value, the group judgment will subsequently improve. Certainly, research from outside the
Delphi domain has indicated that expertise does have an impact on performance within interacting
groups (e.g., Ref. [12]) and that expertise interacts with feedback provision in group and group decision
support system situations (Prietula et al. [13] showed feedback to be useful for experts solving an
engineering problem but to have little effect for ‘intermediate’ experts and none for ‘novices’).
Rowe and Wright [9] produced some empirical support for this theory, finding that people who were,
in general, the best forecasters on the first round of a Delphi, were the least prepared to change judgment
in the face of feedback, while the poorest predictors were the most prepared to change. However, this is
just one piece of evidence for the theory. Does this process take place in all/most Delphis? If not, are
there other explanatory mechanisms for accuracy change that are as good or better? Intuitively, for
example, an equally (or perhaps more) compelling theory is that the holdouts are the most confident,
rather than the most knowledgeable, and as such, Delphi will work when confidence is appropriate but
will not when it is inappropriate (and might be as related to a personality trait as to knowledgeability). A
much greater understanding of the mechanism of judgment change seems warranted.
3. The present study
The experiment reported here attempts to build upon earlier findings through replication (by utilising
three conditions similar to the Rowe and Wright study [9]) and by further exploring the mechanisms
Page 4
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399380
behind the influence of feedback. With respect to the latter aim, it is possible that panellists receiving
Reasons feedback were more discriminative than those receiving Statistical feedback. One possible
explanation for the relative lack of appropriate change in the Statistical condition may be that panellists
were simply following majority opinion (see below), inasmuch as they lacked the qualitative information
in the Reasons feedback to enable them to reject the feedback when it was ill-informed.
One issue alluded to, but not explicitly considered, in Rowe and Wright’s study was the impact of
degree of support or opposition on panellists’ individual judgments in Delphi-like groups. It would
seem likely that this factor is an important determinant of the propensity of an individual to change
his/her judgments over rounds and, relatedly, that it may influence the final round accuracy of those
judgments. In interacting groups, the compelling influence of majorities on individuals has been much
studied (e.g., Refs. [14–16]) as has, more recently, the significant influence that minorities can
occasionally exert on majorities (e.g., Ref. [17,18]). In the case of nominal group techniques, such as
Delphi, however, a key aim is to reduce the conformity pressures often associated with majority
influence (such as via panellist anonymity) that might otherwise compel a consensus (although
whether Delphi-like techniques have succeeded in this aim is a matter of debate, e.g., Refs. [4,19]).
One possibility, which might partially account for the results of Rowe and Wright [9], is that different
types of nominal group feedback have differential utility with regards to encouraging panellists to
resist inappropriate majority influences and to accept appropriate minority ones. The present study
considers this issue.
In Rowe and Wright’s study, ‘objective expertise’ (as defined by the actual accuracy of panellists,
determined posttask) was related to the propensity of subjects to change their estimates in the two
Delphi-like conditions. This is to say that ‘expertise’ is derived objectively by post hoc analysis, rather
than by attempting to determine the expertise of participants a priori. Specifically and in line with
Parente and Anderson-Parente’s [11] interpretation of the Theory of Errors, subjects who were more
‘expert’ exhibited less change over rounds than those who were, on average, less ‘expert’ (see also Ref.
[20]). In the present study, the influences of this factor will once more be considered.
Additionally, the present study will analyse the effect of subjects’ confidence in their forecasts on
opinion change and subsequent accuracy. ‘Confidence’ has been a much-studied phenomenon. One
reason is that, in the absence of clear means for identifying the objective accuracy of judgments and
choices prior to a task, the self-evaluations of individuals or groups may prove the only measurable
determinant of effectiveness [20,21]. However, for confidence to be accepted as an appropriate
performance measure, it must be demonstrably related to objective performance quality. Little research
has been done on the appropriateness of confidence of individuals in noninteracting, nominal groups,
although Armstrong [22] and Sniezek [20] have suggested that confidence may be positively
correlated with individual accuracy. Rowe and Wright [9] examined whether initial individual
confidence was a relevant predictor of individuals’ accuracy and propensity to change judgments in
such procedures. Results indicated that average individual confidence exhibited no significant
relationship to individuals’ propensity to change judgments in any of their three conditions (recall
that these were the Iteration, Statistical, and Reasons conditions). At a different level of analysis,
however, results suggested that subject confidence might be broadly appropriate, given that average
second-round confidence was significantly higher than first-round confidence (in all conditions) and
that this corresponded to an increase in second-round accuracy—a result that would appear to justify
further investigation of the relationship between changes in confidence and accuracy within a nominal
group context.
Page 5
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 381
Structured group techniques, such as Delphi, are primarily concerned with inducing change in their
individual members in the hope of achieving greater collective accuracy through the aggregation of
individual estimates. In the current study, subjects were required to make judgments regarding the
occurrence or nonoccurrence of events in the near future and indicate how confident they were in their
forecasts, with objective accuracy subsequently determined. This procedure allows overall accuracy to
be measured in terms of proportion correct, accuracy improvement to be measured in terms of the
increase or decrease in proportion correct, and change to be measured in terms of the proportion of items
for which a different forecast is made on the second round to that made on the first. Hence, our main
performance/response measures in the present study are concerned with (1) judgment change over
rounds (in terms of magnitude not direction) and (2) accuracy improvement over rounds (the
appropriateness of judgment change).
Additionally, performance quality is measurable in terms of the appropriateness of subjects’
confidence in their judgments. ‘Calibration’ is one measure often used to consider confidence
appropriateness (e.g., Refs. [23,24]) and is demonstrated in the following example. Consider a subject
who gives a confidence value of 70% (or .7) for 10 different forecasts (out of a larger set of forecasts): if
7 of these 10 forecasts prove correct, then that individual is said to be perfectly calibrated; if less than 7
are correct, then she/he is said to be ‘overconfident’; and if more than 7 are correct, then she/he is
deemed to be ‘underconfident’. Likewise, if that subject gave a confidence value of 60% for 10 other
forecasts (from the larger set), then they would be perfectly calibrated if 6 of these were correct (or if 12
of 20 forecasts were correct, etc.), overconfident if less than 6 were correct, and underconfident if more
than 6 were correct.
The concept of calibration is often best expressed and is easiest to understand through the use of
calibration graphs. More precisely, graphs can be drawn to express the relationship between subjects’
mean probability responses and subsequently their forecast accuracy as measured by mean proportion
correct. Perfect calibration means a perfect correspondence between these measures and is manifest in a
calibration graph by the diagonal through the origin. ‘Overconfidence’ occurs when subjects’ probability
responses are greater than proportion correct for the corresponding events. If this is the case, then the
calibration curve falls below the diagonal when mean probability response is plotted on the x axis and
proportion correct on the y axis. Conversely, underconfidence is indicated by a calibration curve lying
above the diagonal, which indicates that, for a particular level of probability response, subjects correctly
predicted the occurrence of events more frequently than they expected.
Although a formula exists for the calculation of calibration per se, which indicates the extent of
individual over- or underconfidence, the main interest of the present study is in the global level of
confidence appropriateness and its relationship to other measures, rather than the precise and
potentially complex nature of miscalibration. A more comprehensive measure of appropriateness
of confidence is the mean probability or Brier score [25]. It is perhaps the most common measure of
the external correspondence of confidence beliefs and is generally accepted as being easier to
interpret than the outcome of the calibration formula. Furthermore, the Brier score encompasses
calibration, with the numerical measure of calibration being a subcomponent of the overall Brier
score (see, for example, Ref. [26], for discussion of the differences between, and advantages of, the
two measures).
Therefore, in the following analysis, the appropriateness of confidence will be assessed in two ways:
graphically, through the use of calibration graphs (to show, visually, how confidence appropriateness
varies across conditions and rounds), and statistically, through analysis of Brier scores. A precise
Page 6
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399382
formulation of the latter measure is given in the Results section. The key issue underlying the
consideration of these additional performance measures is that, if self-reports of confidence are generally
shown to be appropriate, then the case for using confidence as a surrogate for objective expertise is
strengthened.
As in the Rowe and Wright study, the hypotheses below relate to the expected main effects of each of
the external (feedback, majority–minority influence) and internal (objective expertise, confidence)
variables in relation to the two key measures of change and accuracy. The additional performance
measures noted above are discussed in Hypothesis 3 below (i.e., related to the confidence factor), rather
than used to form distinct hypotheses in each of the internal/external factor sections. The policy of
treating the confidence appropriateness measures separately, in a single section, is made to add clarity
and structure to the subsequent analysis.
4. Hypotheses
4.1. Effects of feedback
(A)In line with the findings of Rowe and Wright’s study [9], it is predicted that accuracy will improve
from the first to the second round across all three conditions (Iteration, Statistical, and Reasons) in terms
of the accuracy measure (proportion correct). In accordance with the earlier findings, no differences in
the degree of accuracy improvement, between conditions, are explicitly predicted.
(B)It is predicted that the degree of judgment change over rounds—in terms of the selection of
converse predictions—will be highest for the Iteration condition, next highest for the Statistical
condition, and lowest for the Reasons condition; that is, participants who select a prediction on the
first round of Delphi polling are most likely to amend this prediction in the light of ‘feedback’ in the
Iteration condition and are least likely to amend initial predictions in the Reasons condition. This
hypothesis is in correspondence with the findings of Rowe and Wright, in which the, perhaps,
counterintuitive results (given that more profound feedback might be expected to be more influential
in encouraging judgment change, rather than less) were explained in terms of the differing capacity
of feedback types to inform subjects of the expertise of their group peers; that is, as feedback
becomes more specific and less ambiguous, it may provide more information about the quality of the
others in the subject’s group and, in particular, allow identification of poor or specious arguments
that might otherwise be disguised by a simple statistic. Hence, with Reasons feedback, it is expected
that subjects will change their forecasts less, but when they do change, they should do so on more
appropriate occasions (i.e., generally leading to more accurate predictions over rounds, as per
prediction 1A).
An important caveat to the above prediction (as in the previous study) is that the subjects under each
feedback condition should face, on average, roughly the same degree of opinion pull against their initial
positions: if differences exist across conditions due to the samples of subjects being in relatively greater
agreement or disagreement, then this is liable to mask or distort any differential influence of the feedback
that derives simply from ‘type’. Although a procedure of randomly allocating subject to conditions
should control for such outcomes, the potential importance of this factor on judgment change over
rounds is such that the degree of opinion pull per condition will be addressed analytically in the Results
section.
Page 7
4.2. Objective expertise
Consistent with the Theory of Errors and Ref. [9], it is hypothesised that the greater the objective
expertise of subjects (reflected in a higher proportion of correct first-round predictions), the less their
judgment changes over rounds will be (in terms of the selection of converse options). Inasmuch as only
two of the conditions meet the broad requirements of Delphi (involving some form of feedback), this
prediction is made for the Statistical and Reasons conditions alone; no specific hypothesis is made
concerning the Iteration condition.
4.3. Confidence
Sniezek [21] has suggested that confidence may be an appropriate surrogate predictor of objective
expertise, although results from the Rowe and Wright [9] study were equivocal on this issue. By using
the Brier score, however, the appropriateness of confidence can be calculated.
It is thus predicted that (A) high first-round mean confidence will be related to high first-round
accuracy. With respect to judgment change over rounds, it is predicted that (B) high first-round
confidence will be related to low propensity to make forecast changes over rounds (in terms of changing
predictions to the converse alternative).
Calibration graphs and Brier scores will be used to assess individuals’ appropriateness of confidence
and, in particular, how this changes over rounds in correspondence with the different feedback
conditions. Because hypotheses on these issues are not clear or are motivated by past work, none will
be explicitly stated here, although it might be anticipated that subjects’ confidence appropriateness will
improve over rounds as a result of the provision of feedback and that the respective degree of
improvement will vary across conditions.
4.4. Majority–minority influence
As previously discussed, one of the main reasons for the use of structured groups is the reduction
of the kind of social pressures that are often observed in interacting groups and which may
adversely affect the quality of judgment and performance. Although majorities might still be
expected to exert a degree of pressure for judgment change (over rounds) in structured/nominal
groups, it is to be anticipated that change will be more informed than in interacting groups due to
the lesser immediacy of the anonymous others. Consequently, it is predicted that (A) the more
opponents there are to an individual’s position, the more likely she/he will be to change to the
converse prediction on the second round. However, in anticipation of the more discriminative use of
the feedback information, it is predicted that (B) when the majority of nominal group members are
‘in the right’ (with respect to their first-round prediction), the proportion of correct prediction will
increase on the second round due to the informational and normative influences on the minorities;
but when the majority are ‘in the wrong’, proportion correct will also increase due to the ability of
the minorities to resist majority pressures, and indeed, to pull the incorrect majorities towards their
own positions. This latter hypothesis relates to the Theory of Errors, in that the expectations is that those
individuals who are relatively more expert will ‘hold out’ rather than ‘swing’ and will subsequently pull
the less good towards their position, so enhancing accuracy, even in cases when the majority view is
against them.
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 383
Page 8
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399384
It is expected that the above predicted main effect will be ameliorated by interaction with other
variables, namely, feedback type, objective expertise, and (original) level of confidence, but, as these
interactions are liable to be complex, no predictions are made about these here. It should also be noted
that because the above hypotheses rely on feedback of the opinions of others, they relate to the Statistical
and Reasons feedback conditions alone.
5. Method
5.1. Subjects
Ninety undergraduate students from the University of the West of England (from a variety of
courses) were recruited to complete two questionnaires over the course of 4 days, with each
questionnaire taking approximately 45 min to complete. Subjects were randomly allocated to 18
nominal five-member groups. Thus, six groups were allocated to each of the three feedback conditions.
Four recruited subjects did not arrive for the first questionnaire, such that two of the conditions
(namely, Iteration and Reasons) had only 28 allocated individuals, who were composed into four
groups of five and two groups of four, in each case. All subjects who completed the first questionnaire
returned to complete the second. Subjects were paid £6 sterling upon completion of the second
questionnaire.
5.2. Materials
The questionnaire required forecasts of the occurrence, or nonoccurrence (over the subsequent 2
months), of 60 newsworthy national and international events (the full list is presented in Fig. 1). Two
examples of these possible events are unemployment in Britain falling below a certain level and the
government being defeated by a vote in the House of Commons. Each question was phrased in the
following manner:
‘‘A British cabinet minister
(a) will (% sure)
(b) will not
resign or be sacked during the next 2 months.’’
Instructions directed subjects to first select one of the two exclusive and exhaustive options as a
prediction (i.e., to circle ‘a’ or ‘b’). Next, subjects were required—on a separate sheet of paper—to
give one reason why they had predicted that the event would or would not take place. Finally, subjects
were directed to note—on the line to the right of the question—a value between 50 and 100 to
represent their confidence in their prediction. Instructions informed subjects that a value of 100%
would indicate that they were absolutely sure of their prediction, while 50% would indicate that they
were completely unsure or indifferent between the two choice options (it would make little sense to
give a value less than 50%, as this would imply that they should have chosen the alternative option).
Ayton and Wright [27] reviewed studies that found close similarity between direct probability
estimates and confidence ratings on a scale similar to that used in the present study. Intuitively, it
Page 9
Fig. 1. Summary of the 60 predictions used in the study. All of the predictions listed relate to whether or not the given event will
or will not occur in the subsequent two months.
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 385
Page 10
Fig. 1 (continued).
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399386
seemed that the sample of subjects would find it easier to express their uncertainty as confidence on a
50%-to-100% scale rather than as a probability on a 0.5-to-1.0 scale. Therefore, degree of belief was
measured as a percentage.
Page 11
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 387
The 60 items were maintained in the same order for the subjects in each designated group of four
or five, although the item order was randomised between the ‘groups’ to control for any order or
fatigue effects. The first-round questionnaires contained the same items for all subjects in all three
conditions. The second-round questionnaires were identical to those of the first, save that subjects
were no longer required to give written reasons for their choice of alternative ‘a’ or ‘b’. In this round,
subjects were merely required to select ‘a’ or ‘b’ and to give a percentage value (50 to 100) to indicate
their confidence in their selection. The 60 items were presented in the same order for each subject as
in the first round.
On the second round, subjects in the Statistical and Reasons conditions were additionally supplied
with feedback sheets, on which various information from the subject’s nominal group (from the first
round) was supplied. Statistical feedback comprised a summary of the other three or four group
members’ choice of options (‘a’ or ‘b’), their confidence ratings for those options, and the median and
mean prediction of the group for each event. To calculate the latter value (mean prediction of the group
for each event), the subjects’ ‘half-scale’ responses (50% to 100%) were converted into full-scale
responses from 0% to 100%, with 0% representing absolute confidence that the event ‘would not
happen’ and 100% representing absolute confidence that the event ‘would happen’. In essence, the
utility of converting subjects ‘half-scale’ responses to ‘full-scale’ responses is that the former concerns
individual participant confidence in either forecast ‘a’ or forecast ‘b’, while the latter pertains to the
mean group confidence in both ‘a’ and ‘b’. These figures enabled the calculation of mean and median
responses on the basis of the group confidence of each event taking place. The full-scale values were
then reconverted into half-scale responses so that means and medians could be expressed in terms of
50% to 100% group confidence in alternative ‘a’ or ‘b’. (Definitions of the terms ‘mean’ and ‘median’
were given in the introduction to the second-round questionnaire.) Reasons feedback sheets simply
comprised a copy of the reasons given by the other three or four group members for their responses (‘a’
or ‘b’), with no statistical information on individual confidence values or group means or medians.
5.3. Design
The single independent variable ‘feedback’ had three levels (Iteration, Statistical, and Reasons),
which differed according to the nature of the feedback provided to subjects on the second round. The
design was of an ‘independent subjects’ nature, with each subject taking part in only one of the three
conditions and receiving just one type of feedback for all of the 60 questions. As noted, the order of
presentation of the 60 items was identical for subjects within each group, across rounds, but was
randomised across the nominal groups to control for order/fatigue effects.
5.4. Procedure
On presentation of the first-round questionnaires, subjects were informed that they had each been
allocated to a nominal group comprising three to four others and that group membership was known only
to the experimenters, such that subjects’ own anonymity was ensured. All subjects completed the first-
round questionnaire individually and with no interaction with the others nominally assigned to their
group. Specifics concerning the questionnaires and their instructions have been noted above.
During the 2 days succeeding the first round of study, the collected results were collated so that
individually tailored feedback sheets could be constructed for each subject. At the beginning of the
Page 12
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399388
second round, 3 days after the first, subjects were given their second round questionnaire, their
completed first-round questionnaire (for reference), plus the feedback sheet reporting results from their
nominal group (for the Statistical and Reasons conditions only). Subjects were instructed to complete the
questionnaire as before but, this time, to refer to their first round predictions plus their feedback sheet
and to consider the opinions and positions of the other members of the nominal group (where this was
salient) before making any decisions. For items for which subjects got no feedback per se (i.e., only the
reiteration of their own first-round predictions), subjects were asked to ‘‘think again’’ about the event
and to ‘‘try to think of any additional information that might affect your forecast’’. In each condition,
subjects were thus given the opportunity to alter their selected option (that an event would or would not
happen) and to alter their numerical assessment of how confident they felt in that alternative. The
experimenters did stress, however, that subjects should merely repeat their first-round assessment on the
second questionnaire if they felt this to be appropriate.
The subjects’ predictions were assessed after 2 months using quality British newspapers (The
Guardian, The Times, The Observer, and The Telegraph) to determine daily which of the 60 events
had taken place (in actuality, approximately one third of the events occurred). It should also be noted that
during the 2-day gap between rounds, none of the 60 forecast events occurred.
The written information in the first-round questionnaire was not used in any formal analysis but was
elicited from subjects in all conditions (not only in the Reasons condition) to ensure comparability
between first rounds in each of the three conditions.
6. Results
For each subject for each item, the number of group members who agreed with the subject’s
categorical prediction (that an event either would or would not happen) was calculated, yielding a value
between zero and four (i.e., from no one agreeing with the prediction to agreement by all of the other
four group members). These values were averaged for each subject and then across all subjects in each
condition, to give a mean value of homogeneity of subject opinion. Furthermore, because this calculation
is meaningless in the case of the Iteration condition, where subject homogeneity/heterogeneity could
have no immediate impact on judgment change over rounds, these calculations were limited to the
Reasons and Statistical conditions. Results of a two-tailed independent t test between the two means
(2.66, S.D. = 0.13 and 2.67, S.D. = 0.10, for Reasons and Statistical, respectively) failed to identify any
significant difference between the average degree of homogeneity of subjects in the two conditions
(t = 0.26, P>0.05, df = 34). From this result, we infer that the random allocation of subjects to conditions
proved effective in controlling for the different opinions of subjects and that any effect observed between
the feedback conditions may be ascribed to differences in the nature of the feedback formats.
6.1. Effects of feedback: accuracy
The proportion of correct categorical predictions was calculated for both rounds for each subject.
Table 1 shows the subjects’ performance means and standard deviations on this measure, according to
feedback condition. As can be seen from the table, subjects’ proportion-correct scores were substantially
better than might be expected if subjects were simply providing random answers (i.e., better than 0.5),
with an overall first round mean of 0.667.
Page 13
Table 1
Mean proportion correct in the three conditions over rounds
Condition First round Second round t P
Iteration (n = 28) Mean 0.667 0.663 � 0.34 0.63
S.D. 0.089 0.077
Statistical (n= 30) Mean 0.659 0.681 3.20 0.0016
S.D. 0.065 0.051
Reasons (n = 28) Mean 0.676 0.691 1.48 0.076
S.D. 0.065 0.063
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 389
Repeated-measure t tests (one-tailed) were performed to determine if there was any improvement in
proportion correct over rounds. These tests indicated a significant improvement in accuracy in the case
of the Statistical condition but not in either the Iteration or Reasons conditions, providing only limited
support for Hypothesis 1A. Indeed, in the case of the Iteration condition, results demonstrated a decrease
in accuracy over rounds.
6.2. Effects of feedback: change
The proportion of items on which subjects changed their predictions to the converse option was
calculated for individual subjects, with mean values determined for subjects in each of the three conditions.
These results are summarised in Table 2. No significant differences emerged between the conditions,
although it is notable that the Iteration condition was associated with the highest mean number of
prediction changes, in line with Hypothesis 1B (and the findings of Rowe and Wright [9]). The lack of
significance in the analysis, however, means that the null hypothesis cannot be rejected in this case.
The change measure was additionally correlated to accuracy improvement (i.e., change in proportion-
correct responses per subject over rounds). However, although overall change was significantly
correlated with accuracy improvement in both the Statistical and Iteration conditions (Pearson’s
r=+0.412 and +0.633, respectively, P<0.05 in each case), no relationship appeared to exist between
these variables in the Reasons condition (r=+0.069, P>0.05).
6.3. Objective expertise
It should be remembered that ‘expertise’ is derived by a post hoc analysis of the number of correct
predictions measured by newspaper reports some 2 months after participants had completed the
Table 2
Analysis of mean change over rounds between feedback conditions (one-tailed t tests)
Mean change over rounds per condition Statistical Reasons
Iteration Mean = 0.113 (S.D. = 0.113) t =� 1.19 t =� 1.02
P=0.12 P=0.16
df = 45 df = 39
Statistical Mean = 0.083 (S.D. = 0.072) – t = 0.32
P=0.62
df = 54
Reasons Mean = 0.089 (S.D. = 0.056) – –
Page 14
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399390
experimental questionnaires. In this sense, participants were not experts in any particular discipline or
field of endeavour; a subject’s objective expertise was measured by his or her proportion of questions
answered correctly on the first round. Objective expertise was then correlated with relative propensity to
change judgments over rounds. In all three conditions, high relative expertise was correlated with a low
propensity to change predictions, and this relationship was significant in the Statistical condition
(r=� 0.308, P < 0.05, df = 28) and the Iteration condition (r =� 0.720, P < 0.01, df= 26), although it
failed to reach significance in the Reasons condition (r =� 0.239, P >0.05, df= 26). Similarly, high first-
round accuracy was negatively correlated to proportion-correct accuracy improvement over rounds,
reaching significance at the P < 0.01 level in each condition (i.e., r =� 0.450, � 0.634, and � 0.533 for
Reasons, Statistical, and Iteration, respectively). It appears that those who were the best predictors on the
first round improved the least but had the least need to improve.
These results provide limited support for Hypothesis 2, in that the expected relationship was observed
for the Statistical condition yet not for the Reasons condition (recall that no prediction was made with
regards the relationship between objective expertise and change in the Iteration condition).
6.4. Confidence and accuracy
The mean confidence of subjects in their first-round predictions was calculated and correlated with
mean first-round accuracy (proportion correct). Against Hypothesis 3A no evidence was found that
high confidence was related to high initial accuracy (r =+0.059, P>0.05, df = 84), suggesting, by this
summary measure, that subjects’ confidence was not appropriate. To examine the relationship between
accuracy and confidence in greater detail, calibration graphs were produced and are presented in
Figs. 2–4. These figures are based on grouped data and show the relationship between mean
probability response (i.e., confidence given in percent) and mean proportion correct, on the first and
second rounds, for each of the three conditions. Specifically, to plot each point on the graphs, the
grouped confidence data (e.g., from 30 subjects by 60 items) were converted from percentage to
probability values between 0.5 and 1. The resulting values were then divided into six equal parts
(of 300 items in the Statistical condition, and 280 in the other two conditions). The mean of each of
these parts was taken as the x-coordinate, and the mean proportion correct for the corresponding item
was taken as the y-coordinate.
The figures reveal the nature of subjects’ miscalibration, with all of the curves showing patterns that
are similar to those found generally in studies of probability assessments by groups of individuals (see
Refs. [28,29] for reviews). Specifically, there is slight underconfidence displayed for responses at or
around 0.5 (50%) with increasing overconfidence for higher (confidence) responses. Comparisons
between first- and second-round curves will be considered shortly.
As introduced earlier, a general numerical measure of the appropriateness of confidence is the Brier,
or mean probability, score. The formula for the Brier measure is:
PS ¼ 1
NP
ðd � f Þ2
where N = total number of items, d = outcome index (e.g., 0 = will not happen; 1 = will happen), and
f = probability estimate from 0 to 1 (0 = definitely will not happen; 1 = definitely will happen).
Page 15
Fig. 2. Calibration curves for the two rounds of the Iteration condition.
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 391
The Brier PS is a measure of the correspondence between the accuracy of forecasts and confidence in
those forecasts as expressed by subjective probability. Perfect correspondence gives a score of 0 and
counterperfect correspondence a score of 1. Note that, in practice, the score should not exceed 0.25 as
such can be achieved by someone who responds with a subjective probability of 0.5 (50%) for every
event [29]. The mean Brier score across subjects was 0.228 (S.D. = 0.036), which, although significantly
different from ‘chance’ (two-tailed t = 5.65, P < 0.01), only differed from the chance baseline by a small
amount. This suggests that, although subjects’ confidence ratings may be appropriate to a degree, the
extent of this correspondence may be too small to allow first-round confidence to be used as an effective
indicator of objective expertise/accuracy.
6.5. Confidence and propensity to change
Mean first-round confidence was correlated with subjects’ mean proportion of prediction changes
over rounds. Against Hypothesis 3B, high initial mean confidence was not related to low propensity to
change predictions, when correlated within all subjects (r=+0.081, P >0.05, df = 84) or when
correlated within subjects according to feedback condition (r=� 0.222, +0.146, and +0.255 for
Reasons, Statistical, and Iteration, respectively, all P >0.05). However, the obtained correlation
between the change measure and Brier scores did reveal a strong overall relationship (r=+0.289,
P< 0.01, df= 84), suggesting, initially, that subjects who gave confidence values that were more
appropriate changed their predictions the least. Taking into account condition, however, reveals that
this relationship is conditional upon the nature of feedback, reaching significance only in the Iteration
condition (r=+0.536, P< 0.01, df= 26) but not in either the Statistical (r=+0.196, P >0.05, df = 28) or
Reasons conditions (r=+0.305, P>0.05, df = 26).
The various analyses above consider the relationship between first-round confidence estimates, the
appropriateness of these, and a variety of measures. It is also of interest to see how the pattern of
Page 16
Fig. 3. Calibration curves for the two rounds of the Statistical condition.
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399392
confidence appropriateness changes over rounds and how this is associated with the iteration of
estimates and the presence of feedback from others. Figs. 2–4, which report calibration curves based on
grouped data for the three conditions, indicate that calibration improves over rounds in each case, with
the second-round curves being generally closer to the diagonal ideal than those for the first round. This
improvement is clearest for those Statistical conditions where overconfidence is much reduced at the
Fig. 4. Calibration curves for the two rounds of the Reasons condition.
Page 17
Table 3
Mean Brier scores in the three conditions over rounds
Condition First round Second round t P
Iteration (n = 28) Mean 0.23 0.23 0.85 0.40
S.D. 0.04 0.03
Statistical (n= 30) Mean 0.23 0.22 5.92 < 0.0001
S.D. 0.03 0.03
Reasons (n = 28) Mean 0.22 0.21 3.63 0.0012
S.D. 0.03 0.03
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 393
second round (although more underconfidence is displayed on the lower part of the curve). Improvement
over rounds is also observable in the other two conditions—where there are slight reductions in both
under- and overconfidence.
To further consider whether the graphically suggested improvements in confidence appropriateness
are genuine, mean Brier scores were derived for subjects under each condition in each round and were
compared statistically. The results are reported in Table 3. Although mean Brier score differences across
rounds were small in magnitude, they proved significant at the P < .01 level in both the Statistical and
Reasons conditions but not in the Iteration condition.
In essence, these results suggest that not only does accuracy tend to improve across rounds in Delphi-
like conditions (cf. Table 1), but subjects’ confidence assessments also become more appropriate. In
terms of using first-round confidence measures to attribute expertise, however, the results generally
mirror those of Rowe and Wright [9] and suggest that this is not feasible.
6.6. Majority–minority influences
The level of agreement or disagreement of the members of the nominal groups were measured for
each incident of cross-round prediction change to the converse alternative. Initially, because group size
may be an important determinant of change, this measure was derived separately for prediction
changes of subjects in the groups of five and four. In the former case, each change incident was
classified according to the number of ‘supporters’ of each subject’s initial categorical prediction,
classified as zero, one, two, three, or four; while in the latter case, the number of supporters was
limited to either zero, one, two, or three. The observed frequency of prediction change under each
category of ‘support’ was compared to the expected frequency of change under each category, taking
into account the total number of cases in each support category (across all subjects and items and
irrespective of whether change did or did not occur on the second round). The nature of this
relationship is clarified in Table 4 (groups of five) and Table 5 (groups of four), which tabulate the
proportions of changes/no changes under each degree of support. The information in Table 4 concerns
change versus degree of support aggregated across all groups of five and also across the groups of five
decomposed according to feedback condition; while Table 5 presents information regarding the
Reasons condition only, inasmuch as the two groups of four (ignoring the Iteration condition) both
occurred in this condition.
Chi-square analysis between the variables ‘change’ (yes/no) and ‘degree of support’ (0/1/2/3) and,
when appropriate, (4) gave highly significant results for the groups of five (m2 = 681.23, df = 4,
P < 0.01) and the groups of four (m2 = 72.36, df = 3, P < 0.01). Specifically, and in line with
Page 18
Table 4
Proportion of changes/no changes under different first-round degrees of support (groups of five)
Condition Change Degree of support (groups of 5)
0 1 2 3 4
All Yes 0.369 0.413 0.119 0.067 0.032
No 0.035 0.112 0.214 0.272 0.366
Reasons Yes 0.333 0.441 0.157 0.029 0.039
No 0.040 0.108 0.209 0.281 0.361
Statistical Yes 0.393 0.393 0.093 0.093 0.027
No 0.032 0.115 0.217 0.266 0.370
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399394
Hypothesis 4A, the greater the opposition of group members to subjects’ initial first round
predictions is (i.e., the lesser the support), the greater their propensity to change judgments on the
second round becomes. This finding held for the five-member groups under both feedback conditions
(m2 = 252.10 and 438.84, for Reasons and Statistical, respectively, both df= 4, P< 0.01). Analysis of
the relationship between the feedback conditions (in terms of prediction change) and degree of
support suggests that the variables ‘condition’ and ‘support’ are independent (m2 = 6.97, df = 4,
P>0.05) and that the feedback types hence did not differ significantly in the way in which they
effected or mediated the impact of majorities in the nominal groups. There also does not appear to
be any substantial difference in the pattern of results between data from the groups of five and four
(see Tables 4 and 5).
The above analysis shows that majorities exert strong influences on minorities in terms of evoking
categorical prediction changes over rounds. However, the analysis reveals only a general relationship
and does not account for the fact that individuals in minorities do not always change to the consensus
position nor does it account for the fact that change sometimes occurs against the majority viewpoint
(e.g., see Table 4). The next analysis considers a possible explanation in terms of objective first-round
accuracy—an explanation that is implicit in the Theory of Errors. More specifically, the next analysis
focuses upon those test items on round one where the majority of subjects in a nominal group (i.e.,
three or more) were correct, compared to where the majority were incorrect (the ratio of which was
approximately 3:1). Recall that in the Reasons condition, there were four groups of five and two
groups of four, allowing the possibility of draws in the latter case. Such draws were not included in
the analysis, necessitating the removal of 8 and 15 items out of 60 (and hence out of 32 and 60
individual predictions) in the two groups of four, respectively. Table 6 shows the proportion correct
under each feedback condition, under each round, according to the majority accuracy factor, and
summarises the analysis of the changes in accuracy (using one-tailed related t test) in each cell of the
two-by-two matrix.
Table 5
Proportion of changes/no changes under different first-round degrees of support (groups of four)
Condition Change Degree of support (groups of 4)
0 1 2 3
Reasons Yes 0.426 0.383 0.128 0.064
No 0.081 0.176 0.363 0.381
Page 19
Table 6
Proportion correct in the two feedback conditions over rounds, shown as a function of whether the majority of the nominal
groups were correct or incorrect at the first round
Majority right (first round) Majority wrong (first round)
First
round
Second
round
t P First
round
Second
round
t P
Statistical proportion
correct
Mean 0.82 0.87 4.46 0.0001 0.26 0.22 � 3.10 0.0043
S.D. 0.06 0.06 0.11 0.13
Reasons proportion
correct
Mean 0.84 0.88 1.86 0.074 0.25 0.20 � 1.59 0.12
S.D. 0.08 0.07 0.12 0.12
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 395
As can be seen from Table 6, for items where the majority were right at the first round, subsequent
accuracy improved on the second round, a trend that was found to be significant in the Statistical
condition but not in the Reasons condition. By contrast, for items where the majority were wrong at the
first round, subsequent accuracy declined on the second round—and again, this relationship proved
significant for the Statistical condition but not for the Reasons condition. Analysis using two-way
ANOVAs of the differences in the degree of change of proportion-correct accuracy over rounds between
the feedback conditions did not, however, reveal any significant differences, either in the case of
majority right items [F(1,56) = 0.50, P=0.481] or majority wrong items [F(1,56) = 0.14, P=0.713]. In
terms of the hypotheses, these results do not support Hypothesis 4B, in that the influence of the majority
appears to dominate in both conditions, even to the extent of drawing initially accurate individuals
towards their erroneous positions.
7. Discussion
This study has aimed to reproduce and extend the findings of Rowe and Wright [9] and, in doing so,
has attempted to maintain task characteristics across studies (as much as possible) to increase the chance
of generalising results between them. In terms of reproducing the earlier findings, however, the success
of the present study is limited and lends further weight to the general thesis of Rowe and Wright [9]—
that subtle changes in technique composition, group membership, and other situational factors are liable
to have substantial effects on how and whether Delphi will aid judgmental accuracy. Indeed, the present
study appears to demonstrate just how sensitive results are to any manipulation of task environment. For
example, in spite of the fact that both the present study and the previous one involved short-term
forecasting tasks using groups of five (student) subjects who made predictions about newsworthy
political, economic, and international events, the Iteration condition in the present study led to no
improvement in forecasting across rounds when previously it did. Although the two feedback conditions
in the present study are undoubtedly somewhat dissimilar to those conditions in the earlier study, the
same cannot be said with respect to the Iteration condition, in which precisely the same instructions and
subject requirements were used.
The most probable explanation for the above outcome is that there were subtle differences between
the two tasks and sets of performance measures that were not controlled for across studies. One
Page 20
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399396
difference that might be of importance concerns the nature of elicited subject responses: the previous
study required quantitative responses (e.g., the number of seats to be won by a named political party in a
forthcoming election), while the present study required qualitative ones (i.e., a choice of one answer
from two possibilities). For example, it may be that subjects are more amenable to change and
improvement when they are able to alter an estimate that they might recognise, on reconsideration, to be
exaggerated but may feel a certain defensiveness to admitting a definite, categorical, and undisputed
error—something that would be implied in making a prediction change to the converse of two exclusive
and exhaustive options. Another difference might lie in the relative difficulty of the items being forecast
in the two studies. The essential nature of the cross-study task differences here would seem a topic
worthy of further investigation.
Returning to the results of the present study, evidence was found for an improvement in cross-round
accuracy in the two feedback conditions—a trend that was significant in the Statistical case. With respect
to the propensity of subjects to change predictions over rounds, there were no significant differences
between the three experimental conditions, although (as in the previous study) the Iteration condition led
to a higher proportion of mean changes than either of the feedback conditions. To the extent that results
from the two studies are generalisable, this trend indicates that feedback may actually serve as an
inhibitor of change, perhaps by inducing a defensiveness in panellists. Our feedback formats were
deliberately simplified so as not to confound the effects of Statistical and Reasons feedback, but the
generally prescribed Delphi method uses both types, and it would be interesting to assess in a future
study how panellists responded to both types in one process (e.g., whether one feedback type proved
more influential than another).
In terms of the appropriateness of changes over rounds, however, the present results bear little
similarity to those of Rowe and Wright [9]. In the previous study, an association was found between
increasing propensity of subjects to change judgments and increasing accuracy improvement, in both
feedback conditions (but not in the Iteration condition), contrasting with the present results, where a
similar association was revealed in the Statistical and Iteration conditions but not in the Reasons
condition. Rowe and Wright proposed that feedback allows good forecasters to identify themselves,
while providing information to direct the changes of the less expert—an explanation that might still
account for the outcomes of the Statistical condition but does not explain the lack of influence of the
Reasons feedback or the positive association found in the Iteration condition between these measures.
Explanation of these discrepancies appears to require recourse once more to the task characteristics of
the present study. The general ineffectuality of the Reasons feedback across a number of performance
measures may derive from the sheer number of reasons that our subjects were required to generate.
Computation of the Brier measure requires a large number of probability judgments from subjects. This
may have led to a degree of overload on the subjects, with a consequent decrease in the quality of
arguments and their subsequent ability to appropriately influence opinion change in other panellists.
Indeed, it is worth noting that posttask examination of subjects’ written rationales revealed a large
number of reasons that simply reported opinions (‘I think that this is more likely. . .’) rather than causal
arguments (‘I think that this is more likely because. . .’). Information of the former type is arguably less
rich than that of the latter type and arguably less useful for subjects in terms of discriminating relative
expertise.
With respect to the role of ‘objective expertise’ in nominal groups, success at replicating the results of
Rowe and Wright’s study was once more variable. As in the previous study, it was found that high
relative expertise was associated with a low propensity to change predictions over rounds—an
Page 21
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 397
association that was significant in the Statistical condition and a nonsignificant trend in the Reasons
condition. A similar trend was also found in the Iteration condition, although the general effectiveness of
the Iteration approach proved inferior here to that of the feedback approaches in terms of overall
improvement in aggregate accuracy across rounds.
In Rowe and Wright’s study, mean first-round confidence, as obtained through rating scales, did not
appear to be an especially good predictor of objective expertise. In the present study, an attempt was
made to replicate and extend consideration of the role of confidence in nominal groups, particularly by
using more fine-grained measures of the appropriateness of confidence (i.e., Brier scores). As in the
previous study, little evidence was found of any relationship between subjects’ mean first-round
confidence and either their first-round accuracy or their propensity to change predictions over rounds.
This result argues against the use of confidence measures as discriminants of expertise (for panellist
selection). Consideration of calibration graphs revealed the nature of the miscalibration, with subjects
generally exhibiting overconfidence—a bias that has been frequently reported (e.g., [28,29]).
Providing feedback (either Reasons or Statistical) encouraged more appropriate ratings (i.e., better
Brier scores).
The final factor that was considered in the present study was the influence of majorities and
minorities. Unsurprisingly, it was found that the propensity of subjects to change their predictions over
rounds was significantly related to the degree of support or opposition (in the nominal group) for the
subjects’ initial predictions. The expectation, however, that accurate minorities (excluding overt social
group pressure) could pull inaccurate majorities towards the correct position was not realised; majorities,
whether accurate or otherwise, exerted a significant pull on minorities to the consensual position, even
when that position was fallacious.
The pervasive effect of majorities is, however, likely to be influenced by the degree of expertise
possessed by subjects; although subjects showed a fair degree of competence in this particular
forecasting task (with a hit rate of approximately two-thirds correct), the study of the behaviour of
more-expert subjects might conceivably yield different results, with subjects of a higher base level of
expertise perhaps being more able to resist majority influences in appropriate situations. This is
particularly important, inasmuch as it is not Delphi’s intent to force consensus. Also of interest with
regard to majority/minority influence are the effects of different sized groups and hence the potentially
different magnitudes of opposition and support. The impact of relative expertise and group size on
opinion change and judgment accuracy are areas that ought to be considered in future studies.
For practitioners, the implications of the results of this study are several-fold. First, evidence suggests
that confidence is not a good predictor of expertise and hence should not be used as a selection device
(e.g., to choose among a list of experts). Second and perhaps obviously, practitioners should take care in
choosing their experts, inasmuch as evidence suggests that the ‘better’ the expert is, the more
appropriately they are likely to respond to feedback (see also Ref. [13]). Third, they should be aware
that the power of the majority is not totally undermined in Delphi, and hence, convergence of opinion
over Delphi rounds will not necessarily imply improved forecasting accuracy in every case. And fourth,
the water-muddying results of this study suggest that the practitioner should think carefully about the
nature of feedback they provide, what information it might contain, and how their panellists might react
to it, inasmuch as a variety of studies (including this one) give contrary results as to whether iteration
alone, the feedback of statistical averages, or the use of panellist arguments will lead to most improved
performance over rounds. In future studies, we hope to disentangle the complex interactions between
feedback type, expertise, panellist personality, and accuracy measures.
Page 22
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399398
References
[1] N. Dalkey, O. Helmer, An experimental application of the Delphi method to the use of experts, Manage. Sci. 9 (1963)
458–474.
[2] H.A. Linstone, M. Turoff, The Delphi Method: Techniques and Applications, Addison-Wesley, London, 1975.
[3] G. Rowe, G. Wright, Expert opinions in forecasting: role of the Delphi technique, in: J.S. Armstrong (Ed.),
Principles of Forecasting: A Handbook for Researchers and Practitioners, Kluwer Academic Publishing, Norwell, MA,
2001, pp. 125–144.
[4] G. Rowe, G. Wright, The Delphi technique as a forecasting tool: issues and analysis, Int. J. Forecast. 15 (1999) 353–375.
[5] G. Rowe, G. Wright, F. Bolger, The Delphi technique: a reevaluation of research and theory, Technol. Forecast. Soc.
Change 39 (3) (1991) 235–251.
[6] M. Scheibe, M. Skutsch, J. Schofer, Experiments in Delphi methodology, in: H.A. Linstone, M. Turoff (Eds.), The Delphi
Method: Techniques and Applications, Addison-Wesley, Reading, MA, 1975, pp. 262–287.
[7] F.J. Parente, J.K. Anderson, P. Myers, T. O’Brien, An examination of factors contributing to Delphi accuracy, J. Forecast.
3 (2) (1984) 173–182.
[8] M.E. Boje, J.K. Murnighan, Group confidence pressures in iterative decisions, Manage. Sci. 28 (10) (1982) 1187–1196.
[9] G. Rowe, G. Wright, The impact of task characteristics on the performance of structured group forecasting techniques, Int.
J. Forecast. 12 (1996) 73–90.
[10] R.J. Best, An experiment in Delphi estimation in marketing decision-making, J. Mark. Res. 11 (1974) 448–452.
[11] F.J. Parente, J.K. Anderson-Parente, Delphi inquiry systems, in: G. Wright, P. Ayton (Eds.), Judgmental Forecasting,
Wiley, Chichester, 1987, pp. 129–156.
[12] B.L. Bonner, M.R. Baumann, R.S. Dalal, The effects of member expertise on group decision-making and performance,
Org. Behav. Hum. Decis. Process. 88 (2) (2002) 719–736.
[13] M.J. Prietula, P.J. Feltovich, F. Marchak, Factors influencing analysis of complex cognitive tasks: a framework and
example from industrial process control, Hum. Factors 42 (1) (2000) 56–74.
[14] M. Sherif, A study of some social factors in perception, Arch. Psychol. 27 (187) (1935) 1–60.
[15] S. Asch, Effects of group pressure upon the modification and distortion of judgments, in: H. Guetzkow (Ed.), Groups
Leadership and Men, Carnegie Press, Pittsburgh, PA, 1951.
[16] M. Doms, E. van Avermaet, Majority influence, minority influence and conversion effect: a replication, J. Exp. Soc.
Psychol. 16 (1980) 283–292.
[17] S. Mosocvici, E. Loge, M. Naffrechoux, Influence of consistent minority on the responses of a majority in a color
perception task, Sociometry 32 (1969) 365–380.
[18] A. Maas, R.D. Clark, Hidden impact of minorities: fifteen years of minority influence research, Psychol. Bull. 95 (1984)
428–450.
[19] S. Keeney, F. Hasson, H.P. McKenna, A critical review of the Delphi technique as a research methodology for nursing, Int.
J. Nurs. Stud. 38 (2001) 195–200.
[20] J.A. Sniezek, An examination of group process in judgemental forecasting, Int. J. Forecast. 5 (1989) 171–178.
[21] J.A. Sniezek, Groups under uncertainty: an examination of confidence in group decision making, Org. Behav. Hum.
Decis. Process. 52 (1) (1992) 124–155.
[22] J.S. Armstrong, Long Range Forecasting: From Crystal Ball to Computer, 2nd ed., Wiley, New York, 1985.
[23] A.H. Murphy, A new vector partition of the probability score, J. Appl. Meteorol. 12 (1973) 595–600.
[24] S. Lichtenstein, B. Fischhoff, L.D. Phillips, Calibration of probabilities: the state of the art to 1980, in: D. Kahneman, P.
Slovic, A. Tversky (Eds.), Judgement Under Uncertainty: Heuristics and Biases, Cambridge Univ. Press, Cambridge,
1982, pp. 306–334.
[25] G.W. Brier, Verification of forecasts expressed in terms of probability, Mon. Weather Rev. 75 (1950) 1–3.
[26] J.F. Yates, Judgment and Decision Making, Prentice-Hall, Englewood Cliffs, NJ, 1990.
[27] P. Ayton, G. Wright, Assessing and improving judgmental probability forecasts, Omega 15 (1987) 191–196.
[28] A.G.R. McClelland, F. Bolger, The calibration of subjective probabilities: theories and models 1980–1994, in: G. Wright,
P. Ayton (Eds.), Subjective Probability, Wiley, Chichester, 1994, pp. 453–481.
[29] J.F. Yates, Subjective probability accuracy analysis, in: G. Wright, P. Ayton (Eds.), Subjective Probability, Wiley, Chi-
chester, 1994.
Page 23
G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 399
Gene Rowe is currently a senior scientist in the Consumer Science Group at the Institute of Food Research, Norwich (UK). His
PhD, gained from the Bristol Business School at the University of the West of England (UWE), concerned the use of nominal
groups to improve human judgment and decision making. Apart from a continuing interest in judgment and decision making,
his research activities and publications have also spanned topics from expert systems and forecasting to risk perception and
public participation. Much of his recent work has focused on the issue of evaluating the effectiveness of public participation
exercises.
George Wright is a professor and the Head of the Department of Management at the Durham Business School. He is an
associate editor of both the International Journal of Forecasting and the Journal of Forecasting. His work has been published
in journals such as Management Science and Strategic Management Journal. His interest is in the role and validity of judgment
in forecasting.
Andy McColl is a doctoral student at the Durham Business School. His research interest is in decision framing and the
behavioral and cognitive processes underlying strategic decision making.