Judgment change during Delphi-like procedures: The role of majority influence, expertise, and confidence

Technological Forecasting & Social Change 72 (2005) 377–399

Judgment change during Delphi-like procedures:

The role of majority influence, expertise, and confidence

Gene Rowea,*, George Wrightb, Andy McCollc

a Institute of Food Research, Norwich Research Park, Norwich NR4 7UA, UKbStrathclyde Graduate Business School, University of Strathclyde, 199 Cathedral Street, Glasgow G4 OQU, UK

cDurham Business School, University of Durham, Mill Hill Lane, Durham DH1 3LB, UK

Received 30 January 2004; received in revised form 24 March 2004; accepted 29 March 2004

Abstract

This study investigates individual opinion change and judgmental accuracy in Delphi-like groups. Results

reveal that the accuracy of judgmental probability forecasts increases over Delphi rounds (in terms of proportion

correct and appropriateness of confidence) when statistical summaries or written rationales are provided from other

members of an individual’s nominal group, but does not increase in a control iteration condition (without

feedback). Additionally, subjects who gave more appropriate probability forecasts on the first round exhibited least

opinion change, although measures of confidence were unrelated to opinion change. Results also show that

majority opinion exerts strong opinion pull on minority opinion even when the majority favours an incorrect

answer (irrespective of the nature of feedback provided). The implications of these results for the utility and

conduct of the Delphi technique are discussed, in particular, with respect to selecting panellists and choosing an

appropriate feedback format.

D 2004 Elsevier Inc. All rights reserved.

Keywords: Delphi; Forecasting; Opinion change; Majority influence

1. Introduction: the Delphi technique

The Delphi technique is a forecasting tool that was developed to allow the benefits of canvassing

multiple judges without the often-corresponding deficits associated with group interaction that may arise

from social processes [1–3]. It is a structured group process, in which individuals are required to give

0040-1625/$ - see front matter D 2004 Elsevier Inc. All rights reserved.

doi:10.1016/j.techfore.2004.03.004

* Corresponding author. Tel.: +44-1603-255-125.

E-mail addresses: [email protected] (G. Rowe), [email protected] (G. Wright), [email protected]

(A. McColl).

https://www.researchgate.net/publication/237035943_The_DELPHI_method_techniques_and_applications?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/272177878_An_Experimental_Application_of_the_Delphi_Method_to_the_Use_of_Experts?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399378

numerical judgments or forecasts over a number of rounds, with feedback being provided from the

anonymous other members of the panel, and the final aggregate being taken as the process output. It is

not, however, a method intended to force consensus per se—response stability rather than consensus is

the signal to cease additional polling, with disagreement (as indicated by, for example, a bipolar

distribution of responses) accepted as informative.

Delphi’s effectiveness over comparative procedures, at least in terms of judgmental accuracy, has

generally been demonstrated [3]. In a review of empirical studies of Delphi, Rowe and Wright [4] found

that Delphi groups outperformed ‘statistical’ groups (which involve the aggregation of the judgments of

noninteracting individuals) in 12 studies, underperformed these in two, and ‘tied’ in two others, while

Delphi outperformed standard interacting groups in five studies, underperformed in one, and ‘tied’ in

two. This trend is all the more impressive given that many laboratory studies of Delphi effectiveness

have used simplified versions of the technique (e.g., with limited feedback) in simplified contexts (e.g.,

using nonexpert, student subjects) that might be anticipated to undermine the virtues of the technique

[4,5]. We return to this issue shortly.

Although research suggests that Delphi allows improved judgment compared to alternative methods,

as demonstrated in these ‘technique comparison’ studies, the reasons for this are still unclear, given a

relative dearth of ‘process’ studies that have attempted to establish the precise mechanism for

improvement in Delphi [5]. In this study, we attempt to advance understanding of how Delphi improves

judgmental performance.

2. Research into the Delphi process: variables and measures

Generally, it is assumed that Delphi ‘works’ (improves judgmental accuracy) because of the feedback

provided between rounds in conjunction with its anonymity; that is, it is assumed that the lack of cues as

to the identity of the sources of the feedback allows participants to focus upon the content of that

feedback rather than being distracted by extraneous social information, so being influenced by ‘good’

information and uninfluenced by apparently ‘poor’ information. Indeed, the power of feedback was

demonstrated by Scheibe et al. [6], who gave ‘false’ feedback to panellists and found that their estimates

still converged towards this fallacious figure.

However, the extent of influence of feedback is uncertain, in the sense that the variable is confounded

in Delphi studies by the iteration variable. After all, it is possible that Delphi panellists may ignore

feedback and still improve, by using the opportunity afforded by each new round to review their own

previous estimate. Indeed, Parente et al. [7] attempted to separate out the effects of iteration and

feedback and found that, although neither iterated polling nor consensus feedback had a discernible

effect upon ‘if’ accuracy (forecasting if an event would occur), a condition involving iteration alone

resulted in improved accuracy for ‘when’ a newsworthy event would occur while a purely feedback

condition (no iteration involved) did not. Boje and Murnighan [8] also found that accuracy decreased

over rounds in a standard Delphi procedure yet improved in a purely iterative condition. These results

seem to suggest that it is iteration that is the more powerful influence leading to improved accuracy, and

not feedback.

Rowe and Wright [9] have argued, however, that the feedback used in previous studies has been

somewhat superficial. For example, the feedback in Parente et al. [7] simply comprised modes and

medians, while that from Boje and Murnighan [8] comprised estimates and their justifications, without

https://www.researchgate.net/publication/239983619_Expert_Opinions_in_Forecasting_The_Role_of_the_Delphi_Technique?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/4960099_Rowe_G_Wright_G_The_Delphi_technique_as_a_forecasting_tool_issues_and_analysis?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==


https://www.researchgate.net/publication/228237839_The_Delphi_Technique_A_Re-Evaluation_of_Research_and_Theory?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/228237839_The_Delphi_Technique_A_Re-Evaluation_of_Research_and_Theory?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/229890979_An_examination_of_factors_contributing_to_Delphi_accuracy?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==


https://www.researchgate.net/publication/227445187_Group_Confidence_Pressures_in_Iterative_Decisions?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==


https://www.researchgate.net/publication/223510367_The_impact_of_task_characteristics_on_performance_of_structured_forecasting_techniques?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

G. Rowe et al. / Technological Forecasting & Social Change 72 (2005) 377–399 379

average figures. They argued that better quality, more informative feedback is liable to be more

(positively) influential. Indeed, Best [10] found that for one of two task items, a Delphi group that was

given ‘Reasons’ feedback in addition to a median and range of estimates was more accurate than a

Delphi group that was provided with feedback that excluded reasons. Rowe and Wright [9] themselves

compared three feedback conditions: an ‘Iteration’ condition (involving iteration over rounds without

feedback from the members of the Delphi panel), a ‘Statistical’ feedback condition (involving median

values and range of estimates), and a ‘Reasons’ feedback condition (involving reasons from the Delphi

panellists along with their numerical estimates). They found that, although subjects were less inclined to

change their forecasts as a result of receiving Reasons feedback than other types, when they did change

forecasts, this change tended to be for the better, leading to a reduction in error. Although subjects tended

to make greater changes to their forecasts in the Iteration and Statistical conditions than in the Reasons

condition, these changes did not, in general, improve predictions.

From these studies, it may be suggested that Delphi ‘works’ partially because of the influence of

iteration, allowing panellists to reflect on previous answers, and partially because of the influence of

feedback, which is most likely to be successful when more rather than less informative. The output from

a Delphi process, however, is an aggregate of the individual judgments of the panellists, and this research

says little about who among the panellists is affected by what aspects of the technique and in what way.

Understanding the mechanics of judgment change in Delphi is likely to have implications, not only for

the best structure of Delphi (e.g., best way to provide feedback), but also for the selection of panellists.

One explanation for how Delphi works that considers the nature of panellists is the ‘Theory of Errors’.

Parente and Anderson-Parente [11] interpret this as follows: Delphi accuracy improves over rounds,

because the panellists comprise essentially two types of people: the ‘holdouts’, who do not change their

opinion (much) over rounds, and the ‘swingers’, who do. If the holdouts are the more accurate (expert)

panellists, then it can be mathematically shown that if the swingers alter their judgments towards the

average value, the group judgment will subsequently improve. Certainly, research from outside the

Delphi domain has indicated that expertise does have an impact on performance within interacting

groups (e.g., Ref. [12]) and that expertise interacts with feedback provision in group and group decision

support system situations (Prietula et al. [13] showed feedback to be useful for experts solving an

engineering problem but to have little effect for ‘intermediate’ experts and none for ‘novices’).

Rowe and Wright [9] produced some empirical support for this theory, finding that people who were,

in general, the best forecasters on the first round of a Delphi, were the least prepared to change judgment

in the face of feedback, while the poorest predictors were the most prepared to change. However, this is

just one piece of evidence for the theory. Does this process take place in all/most Delphis? If not, are

there other explanatory mechanisms for accuracy change that are as good or better? Intuitively, for

example, an equally (or perhaps more) compelling theory is that the holdouts are the most confident,

rather than the most knowledgeable, and as such, Delphi will work when confidence is appropriate but

will not when it is inappropriate (and might be as related to a personality trait as to knowledgeability). A

much greater understanding of the mechanism of judgment change seems warranted.

3. The present study

The experiment reported here attempts to build upon earlier findings through replication (by utilising

three conditions similar to the Rowe and Wright study [9]) and by further exploring the mechanisms




https://www.researchgate.net/publication/245335850_An_Experiment_in_Delphi_Estimation_in_Marketing_Decision_Making?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/222366436_The_effects_of_member_expertise_on_group_decision-making_and_performance?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/12403061_Factors_Influencing_Analysis_of_Complex_Cognitive_Tasks_A_Framework_and_Example_from_Industrial_Process_Control?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/284802067_Delphi_inquiry_systems?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==


behind the influence of feedback. With respect to the latter aim, it is possible that panellists receiving

Reasons feedback were more discriminative than those receiving Statistical feedback. One possible

explanation for the relative lack of appropriate change in the Statistical condition may be that panellists

were simply following majority opinion (see below), inasmuch as they lacked the qualitative information

in the Reasons feedback to enable them to reject the feedback when it was ill-informed.

One issue alluded to, but not explicitly considered, in Rowe and Wright’s study was the impact of

degree of support or opposition on panellists’ individual judgments in Delphi-like groups. It would

seem likely that this factor is an important determinant of the propensity of an individual to change

his/her judgments over rounds and, relatedly, that it may influence the final round accuracy of those

judgments. In interacting groups, the compelling influence of majorities on individuals has been much

studied (e.g., Refs. [14–16]) as has, more recently, the significant influence that minorities can

occasionally exert on majorities (e.g., Ref. [17,18]). In the case of nominal group techniques, such as

Delphi, however, a key aim is to reduce the conformity pressures often associated with majority

influence (such as via panellist anonymity) that might otherwise compel a consensus (although

whether Delphi-like techniques have succeeded in this aim is a matter of debate, e.g., Refs. [4,19]).

One possibility, which might partially account for the results of Rowe and Wright [9], is that different

types of nominal group feedback have differential utility with regards to encouraging panellists to

resist inappropriate majority influences and to accept appropriate minority ones. The present study

considers this issue.

In Rowe and Wright’s study, ‘objective expertise’ (as defined by the actual accuracy of panellists,

determined posttask) was related to the propensity of subjects to change their estimates in the two

Delphi-like conditions. This is to say that ‘expertise’ is derived objectively by post hoc analysis, rather

than by attempting to determine the expertise of participants a priori. Specifically and in line with

Parente and Anderson-Parente’s [11] interpretation of the Theory of Errors, subjects who were more

‘expert’ exhibited less change over rounds than those who were, on average, less ‘expert’ (see also Ref.

[20]). In the present study, the influences of this factor will once more be considered.

Additionally, the present study will analyse the effect of subjects’ confidence in their forecasts on

opinion change and subsequent accuracy. ‘Confidence’ has been a much-studied phenomenon. One

reason is that, in the absence of clear means for identifying the objective accuracy of judgments and

choices prior to a task, the self-evaluations of individuals or groups may prove the only measurable

determinant of effectiveness [20,21]. However, for confidence to be accepted as an appropriate

performance measure, it must be demonstrably related to objective performance quality. Little research

has been done on the appropriateness of confidence of individuals in noninteracting, nominal groups,

although Armstrong [22] and Sniezek [20] have suggested that confidence may be positively

correlated with individual accuracy. Rowe and Wright [9] examined whether initial individual

confidence was a relevant predictor of individuals’ accuracy and propensity to change judgments in

such procedures. Results indicated that average individual confidence exhibited no significant

relationship to individuals’ propensity to change judgments in any of their three conditions (recall

that these were the Iteration, Statistical, and Reasons conditions). At a different level of analysis,

however, results suggested that subject confidence might be broadly appropriate, given that average

second-round confidence was significantly higher than first-round confidence (in all conditions) and

that this corresponded to an increase in second-round accuracy—a result that would appear to justify

further investigation of the relationship between changes in confidence and accuracy within a nominal

group context.




https://www.researchgate.net/publication/232538194_A_Study_of_Some_Social_Factors_in_Perception?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/256230522_Influence_minority_influence_and_conversion_behavior_A_replication?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/17796328_Influence_of_a_Consistent_Minority_on_the_Responses_of_a_Majority_in_a_Color_Perception_Task?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/232490264_Hidden_Impact_of_Minorities_Fifteen_Years_of_Minority_Influence_Research?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/12108469_Keeney_S_Hasson_F_McKenna_HP_A_critical_review_of_the_Delphi_technique_as_a_research_methodology_for_nursing_Int_J_Nurs_Stud_38_195-200?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/4829138_An_examination_of_group_process_in_judgmental_forecasting?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==



https://www.researchgate.net/publication/4815861_Groups_under_Uncertainty_An_Examination_of_Confidence_in_Group_Decision_Making?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/270942283_Long-Range_Forecasting_From_Crystal_Ball_to_Computer?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==


https://www.researchgate.net/publication/285797288_Effects_of_group_pressure_on_modification_and_distortion_of_judgments?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==


Structured group techniques, such as Delphi, are primarily concerned with inducing change in their

individual members in the hope of achieving greater collective accuracy through the aggregation of

individual estimates. In the current study, subjects were required to make judgments regarding the

occurrence or nonoccurrence of events in the near future and indicate how confident they were in their

forecasts, with objective accuracy subsequently determined. This procedure allows overall accuracy to

be measured in terms of proportion correct, accuracy improvement to be measured in terms of the

increase or decrease in proportion correct, and change to be measured in terms of the proportion of items

for which a different forecast is made on the second round to that made on the first. Hence, our main

performance/response measures in the present study are concerned with (1) judgment change over

rounds (in terms of magnitude not direction) and (2) accuracy improvement over rounds (the

appropriateness of judgment change).

Additionally, performance quality is measurable in terms of the appropriateness of subjects’

confidence in their judgments. ‘Calibration’ is one measure often used to consider confidence

appropriateness (e.g., Refs. [23,24]) and is demonstrated in the following example. Consider a subject

who gives a confidence value of 70% (or .7) for 10 different forecasts (out of a larger set of forecasts): if

7 of these 10 forecasts prove correct, then that individual is said to be perfectly calibrated; if less than 7

are correct, then she/he is said to be ‘overconfident’; and if more than 7 are correct, then she/he is

deemed to be ‘underconfident’. Likewise, if that subject gave a confidence value of 60% for 10 other

forecasts (from the larger set), then they would be perfectly calibrated if 6 of these were correct (or if 12

of 20 forecasts were correct, etc.), overconfident if less than 6 were correct, and underconfident if more

than 6 were correct.

The concept of calibration is often best expressed and is easiest to understand through the use of

calibration graphs. More precisely, graphs can be drawn to express the relationship between subjects’

mean probability responses and subsequently their forecast accuracy as measured by mean proportion

correct. Perfect calibration means a perfect correspondence between these measures and is manifest in a

calibration graph by the diagonal through the origin. ‘Overconfidence’ occurs when subjects’ probability

responses are greater than proportion correct for the corresponding events. If this is the case, then the

calibration curve falls below the diagonal when mean probability response is plotted on the x axis and

proportion correct on the y axis. Conversely, underconfidence is indicated by a calibration curve lying

above the diagonal, which indicates that, for a particular level of probability response, subjects correctly

predicted the occurrence of events more frequently than they expected.

Although a formula exists for the calculation of calibration per se, which indicates the extent of

individual over- or underconfidence, the main interest of the present study is in the global level of

confidence appropriateness and its relationship to other measures, rather than the precise and

potentially complex nature of miscalibration. A more comprehensive measure of appropriateness

of confidence is the mean probability or Brier score [25]. It is perhaps the most common measure of

the external correspondence of confidence beliefs and is generally accepted as being easier to

interpret than the outcome of the calibration formula. Furthermore, the Brier score encompasses

calibration, with the numerical measure of calibration being a subcomponent of the overall Brier

score (see, for example, Ref. [26], for discussion of the differences between, and advantages of, the

two measures).

Therefore, in the following analysis, the appropriateness of confidence will be assessed in two ways:

graphically, through the use of calibration graphs (to show, visually, how confidence appropriateness

varies across conditions and rounds), and statistically, through analysis of Brier scores. A precise

https://www.researchgate.net/publication/234395762_A_New_Vector_Partition_of_the_Probability_Score?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/41818391_Decision_Making_and_Human_Affairs?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/235910409_Verification_of_forecasts_expressed_of_probability?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/291595736_Calibration_of_probabilities_the_state_of_the_art_to_1980?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/303260259_A_new_vector_partition_of_the_probability_score?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==


formulation of the latter measure is given in the Results section. The key issue underlying the

consideration of these additional performance measures is that, if self-reports of confidence are generally

shown to be appropriate, then the case for using confidence as a surrogate for objective expertise is

strengthened.

As in the Rowe and Wright study, the hypotheses below relate to the expected main effects of each of

the external (feedback, majority–minority influence) and internal (objective expertise, confidence)

variables in relation to the two key measures of change and accuracy. The additional performance

measures noted above are discussed in Hypothesis 3 below (i.e., related to the confidence factor), rather

than used to form distinct hypotheses in each of the internal/external factor sections. The policy of

treating the confidence appropriateness measures separately, in a single section, is made to add clarity

and structure to the subsequent analysis.

4. Hypotheses

4.1. Effects of feedback

(A)In line with the findings of Rowe and Wright’s study [9], it is predicted that accuracy will improve

from the first to the second round across all three conditions (Iteration, Statistical, and Reasons) in terms

of the accuracy measure (proportion correct). In accordance with the earlier findings, no differences in

the degree of accuracy improvement, between conditions, are explicitly predicted.

(B)It is predicted that the degree of judgment change over rounds—in terms of the selection of

converse predictions—will be highest for the Iteration condition, next highest for the Statistical

condition, and lowest for the Reasons condition; that is, participants who select a prediction on the

first round of Delphi polling are most likely to amend this prediction in the light of ‘feedback’ in the

Iteration condition and are least likely to amend initial predictions in the Reasons condition. This

hypothesis is in correspondence with the findings of Rowe and Wright, in which the, perhaps,

counterintuitive results (given that more profound feedback might be expected to be more influential

in encouraging judgment change, rather than less) were explained in terms of the differing capacity

of feedback types to inform subjects of the expertise of their group peers; that is, as feedback

becomes more specific and less ambiguous, it may provide more information about the quality of the

others in the subject’s group and, in particular, allow identification of poor or specious arguments

that might otherwise be disguised by a simple statistic. Hence, with Reasons feedback, it is expected

that subjects will change their forecasts less, but when they do change, they should do so on more

appropriate occasions (i.e., generally leading to more accurate predictions over rounds, as per

prediction 1A).

An important caveat to the above prediction (as in the previous study) is that the subjects under each

feedback condition should face, on average, roughly the same degree of opinion pull against their initial

positions: if differences exist across conditions due to the samples of subjects being in relatively greater

agreement or disagreement, then this is liable to mask or distort any differential influence of the feedback

that derives simply from ‘type’. Although a procedure of randomly allocating subject to conditions

should control for such outcomes, the potential importance of this factor on judgment change over

rounds is such that the degree of opinion pull per condition will be addressed analytically in the Results

section.


4.2. Objective expertise

Consistent with the Theory of Errors and Ref. [9], it is hypothesised that the greater the objective

expertise of subjects (reflected in a higher proportion of correct first-round predictions), the less their

judgment changes over rounds will be (in terms of the selection of converse options). Inasmuch as only

two of the conditions meet the broad requirements of Delphi (involving some form of feedback), this

prediction is made for the Statistical and Reasons conditions alone; no specific hypothesis is made

concerning the Iteration condition.

4.3. Confidence

Sniezek [21] has suggested that confidence may be an appropriate surrogate predictor of objective

expertise, although results from the Rowe and Wright [9] study were equivocal on this issue. By using

the Brier score, however, the appropriateness of confidence can be calculated.

It is thus predicted that (A) high first-round mean confidence will be related to high first-round

accuracy. With respect to judgment change over rounds, it is predicted that (B) high first-round

confidence will be related to low propensity to make forecast changes over rounds (in terms of changing

predictions to the converse alternative).

Calibration graphs and Brier scores will be used to assess individuals’ appropriateness of confidence

and, in particular, how this changes over rounds in correspondence with the different feedback

conditions. Because hypotheses on these issues are not clear or are motivated by past work, none will

be explicitly stated here, although it might be anticipated that subjects’ confidence appropriateness will

improve over rounds as a result of the provision of feedback and that the respective degree of

improvement will vary across conditions.

4.4. Majority–minority influence

As previously discussed, one of the main reasons for the use of structured groups is the reduction

of the kind of social pressures that are often observed in interacting groups and which may

adversely affect the quality of judgment and performance. Although majorities might still be

expected to exert a degree of pressure for judgment change (over rounds) in structured/nominal

groups, it is to be anticipated that change will be more informed than in interacting groups due to

the lesser immediacy of the anonymous others. Consequently, it is predicted that (A) the more

opponents there are to an individual’s position, the more likely she/he will be to change to the

converse prediction on the second round. However, in anticipation of the more discriminative use of

the feedback information, it is predicted that (B) when the majority of nominal group members are

‘in the right’ (with respect to their first-round prediction), the proportion of correct prediction will

increase on the second round due to the informational and normative influences on the minorities;

but when the majority are ‘in the wrong’, proportion correct will also increase due to the ability of

the minorities to resist majority pressures, and indeed, to pull the incorrect majorities towards their

own positions. This latter hypothesis relates to the Theory of Errors, in that the expectations is that those

individuals who are relatively more expert will ‘hold out’ rather than ‘swing’ and will subsequently pull

the less good towards their position, so enhancing accuracy, even in cases when the majority view is

against them.






It is expected that the above predicted main effect will be ameliorated by interaction with other

variables, namely, feedback type, objective expertise, and (original) level of confidence, but, as these

interactions are liable to be complex, no predictions are made about these here. It should also be noted

that because the above hypotheses rely on feedback of the opinions of others, they relate to the Statistical

and Reasons feedback conditions alone.

5. Method

5.1. Subjects

Ninety undergraduate students from the University of the West of England (from a variety of

courses) were recruited to complete two questionnaires over the course of 4 days, with each

questionnaire taking approximately 45 min to complete. Subjects were randomly allocated to 18

nominal five-member groups. Thus, six groups were allocated to each of the three feedback conditions.

Four recruited subjects did not arrive for the first questionnaire, such that two of the conditions

(namely, Iteration and Reasons) had only 28 allocated individuals, who were composed into four

groups of five and two groups of four, in each case. All subjects who completed the first questionnaire

returned to complete the second. Subjects were paid £6 sterling upon completion of the second

questionnaire.

5.2. Materials

The questionnaire required forecasts of the occurrence, or nonoccurrence (over the subsequent 2

months), of 60 newsworthy national and international events (the full list is presented in Fig. 1). Two

examples of these possible events are unemployment in Britain falling below a certain level and the

government being defeated by a vote in the House of Commons. Each question was phrased in the

following manner:

‘‘A British cabinet minister

(a) will (% sure)

(b) will not

resign or be sacked during the next 2 months.’’

Instructions directed subjects to first select one of the two exclusive and exhaustive options as a

prediction (i.e., to circle ‘a’ or ‘b’). Next, subjects were required—on a separate sheet of paper—to

give one reason why they had predicted that the event would or would not take place. Finally, subjects

were directed to note—on the line to the right of the question—a value between 50 and 100 to

represent their confidence in their prediction. Instructions informed subjects that a value of 100%

would indicate that they were absolutely sure of their prediction, while 50% would indicate that they

were completely unsure or indifferent between the two choice options (it would make little sense to

give a value less than 50%, as this would imply that they should have chosen the alternative option).

Ayton and Wright [27] reviewed studies that found close similarity between direct probability

estimates and confidence ratings on a scale similar to that used in the present study. Intuitively, it

https://www.researchgate.net/publication/23793415_Assessing_and_Improving_Judgemental_Probability_Forecasts?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

Fig. 1. Summary of the 60 predictions used in the study. All of the predictions listed relate to whether or not the given event will

or will not occur in the subsequent two months.


Fig. 1 (continued).


seemed that the sample of subjects would find it easier to express their uncertainty as confidence on a

50%-to-100% scale rather than as a probability on a 0.5-to-1.0 scale. Therefore, degree of belief was

measured as a percentage.


The 60 items were maintained in the same order for the subjects in each designated group of four

or five, although the item order was randomised between the ‘groups’ to control for any order or

fatigue effects. The first-round questionnaires contained the same items for all subjects in all three

conditions. The second-round questionnaires were identical to those of the first, save that subjects

were no longer required to give written reasons for their choice of alternative ‘a’ or ‘b’. In this round,

subjects were merely required to select ‘a’ or ‘b’ and to give a percentage value (50 to 100) to indicate

their confidence in their selection. The 60 items were presented in the same order for each subject as

in the first round.

On the second round, subjects in the Statistical and Reasons conditions were additionally supplied

with feedback sheets, on which various information from the subject’s nominal group (from the first

round) was supplied. Statistical feedback comprised a summary of the other three or four group

members’ choice of options (‘a’ or ‘b’), their confidence ratings for those options, and the median and

mean prediction of the group for each event. To calculate the latter value (mean prediction of the group

for each event), the subjects’ ‘half-scale’ responses (50% to 100%) were converted into full-scale

responses from 0% to 100%, with 0% representing absolute confidence that the event ‘would not

happen’ and 100% representing absolute confidence that the event ‘would happen’. In essence, the

utility of converting subjects ‘half-scale’ responses to ‘full-scale’ responses is that the former concerns

individual participant confidence in either forecast ‘a’ or forecast ‘b’, while the latter pertains to the

mean group confidence in both ‘a’ and ‘b’. These figures enabled the calculation of mean and median

responses on the basis of the group confidence of each event taking place. The full-scale values were

then reconverted into half-scale responses so that means and medians could be expressed in terms of

50% to 100% group confidence in alternative ‘a’ or ‘b’. (Definitions of the terms ‘mean’ and ‘median’

were given in the introduction to the second-round questionnaire.) Reasons feedback sheets simply

comprised a copy of the reasons given by the other three or four group members for their responses (‘a’

or ‘b’), with no statistical information on individual confidence values or group means or medians.

5.3. Design

The single independent variable ‘feedback’ had three levels (Iteration, Statistical, and Reasons),

which differed according to the nature of the feedback provided to subjects on the second round. The

design was of an ‘independent subjects’ nature, with each subject taking part in only one of the three

conditions and receiving just one type of feedback for all of the 60 questions. As noted, the order of

presentation of the 60 items was identical for subjects within each group, across rounds, but was

randomised across the nominal groups to control for order/fatigue effects.

5.4. Procedure

On presentation of the first-round questionnaires, subjects were informed that they had each been

allocated to a nominal group comprising three to four others and that group membership was known only

to the experimenters, such that subjects’ own anonymity was ensured. All subjects completed the first-

round questionnaire individually and with no interaction with the others nominally assigned to their

group. Specifics concerning the questionnaires and their instructions have been noted above.

During the 2 days succeeding the first round of study, the collected results were collated so that

individually tailored feedback sheets could be constructed for each subject. At the beginning of the


second round, 3 days after the first, subjects were given their second round questionnaire, their

completed first-round questionnaire (for reference), plus the feedback sheet reporting results from their

nominal group (for the Statistical and Reasons conditions only). Subjects were instructed to complete the

questionnaire as before but, this time, to refer to their first round predictions plus their feedback sheet

and to consider the opinions and positions of the other members of the nominal group (where this was

salient) before making any decisions. For items for which subjects got no feedback per se (i.e., only the

reiteration of their own first-round predictions), subjects were asked to ‘‘think again’’ about the event

and to ‘‘try to think of any additional information that might affect your forecast’’. In each condition,

subjects were thus given the opportunity to alter their selected option (that an event would or would not

happen) and to alter their numerical assessment of how confident they felt in that alternative. The

experimenters did stress, however, that subjects should merely repeat their first-round assessment on the

second questionnaire if they felt this to be appropriate.

The subjects’ predictions were assessed after 2 months using quality British newspapers (The

Guardian, The Times, The Observer, and The Telegraph) to determine daily which of the 60 events

had taken place (in actuality, approximately one third of the events occurred). It should also be noted that

during the 2-day gap between rounds, none of the 60 forecast events occurred.

The written information in the first-round questionnaire was not used in any formal analysis but was

elicited from subjects in all conditions (not only in the Reasons condition) to ensure comparability

between first rounds in each of the three conditions.

6. Results

For each subject for each item, the number of group members who agreed with the subject’s

categorical prediction (that an event either would or would not happen) was calculated, yielding a value

between zero and four (i.e., from no one agreeing with the prediction to agreement by all of the other

four group members). These values were averaged for each subject and then across all subjects in each

condition, to give a mean value of homogeneity of subject opinion. Furthermore, because this calculation

is meaningless in the case of the Iteration condition, where subject homogeneity/heterogeneity could

have no immediate impact on judgment change over rounds, these calculations were limited to the

Reasons and Statistical conditions. Results of a two-tailed independent t test between the two means

(2.66, S.D. = 0.13 and 2.67, S.D. = 0.10, for Reasons and Statistical, respectively) failed to identify any

significant difference between the average degree of homogeneity of subjects in the two conditions

(t = 0.26, P>0.05, df = 34). From this result, we infer that the random allocation of subjects to conditions

proved effective in controlling for the different opinions of subjects and that any effect observed between

the feedback conditions may be ascribed to differences in the nature of the feedback formats.

6.1. Effects of feedback: accuracy

The proportion of correct categorical predictions was calculated for both rounds for each subject.

Table 1 shows the subjects’ performance means and standard deviations on this measure, according to

feedback condition. As can be seen from the table, subjects’ proportion-correct scores were substantially

better than might be expected if subjects were simply providing random answers (i.e., better than 0.5),

with an overall first round mean of 0.667.

Table 1

Mean proportion correct in the three conditions over rounds

Condition First round Second round t P

Iteration (n = 28) Mean 0.667 0.663 � 0.34 0.63

S.D. 0.089 0.077

Statistical (n= 30) Mean 0.659 0.681 3.20 0.0016

S.D. 0.065 0.051

Reasons (n = 28) Mean 0.676 0.691 1.48 0.076

S.D. 0.065 0.063


Repeated-measure t tests (one-tailed) were performed to determine if there was any improvement in

proportion correct over rounds. These tests indicated a significant improvement in accuracy in the case

of the Statistical condition but not in either the Iteration or Reasons conditions, providing only limited

support for Hypothesis 1A. Indeed, in the case of the Iteration condition, results demonstrated a decrease

in accuracy over rounds.

6.2. Effects of feedback: change

The proportion of items on which subjects changed their predictions to the converse option was

calculated for individual subjects, with mean values determined for subjects in each of the three conditions.

These results are summarised in Table 2. No significant differences emerged between the conditions,

although it is notable that the Iteration condition was associated with the highest mean number of

prediction changes, in line with Hypothesis 1B (and the findings of Rowe and Wright [9]). The lack of

significance in the analysis, however, means that the null hypothesis cannot be rejected in this case.

The change measure was additionally correlated to accuracy improvement (i.e., change in proportion-

correct responses per subject over rounds). However, although overall change was significantly

correlated with accuracy improvement in both the Statistical and Iteration conditions (Pearson’s

r=+0.412 and +0.633, respectively, P<0.05 in each case), no relationship appeared to exist between

these variables in the Reasons condition (r=+0.069, P>0.05).

6.3. Objective expertise

It should be remembered that ‘expertise’ is derived by a post hoc analysis of the number of correct

predictions measured by newspaper reports some 2 months after participants had completed the

Table 2

Analysis of mean change over rounds between feedback conditions (one-tailed t tests)

Mean change over rounds per condition Statistical Reasons

Iteration Mean = 0.113 (S.D. = 0.113) t =� 1.19 t =� 1.02

P=0.12 P=0.16

df = 45 df = 39

Statistical Mean = 0.083 (S.D. = 0.072) – t = 0.32

P=0.62

df = 54

Reasons Mean = 0.089 (S.D. = 0.056) – –



experimental questionnaires. In this sense, participants were not experts in any particular discipline or

field of endeavour; a subject’s objective expertise was measured by his or her proportion of questions

answered correctly on the first round. Objective expertise was then correlated with relative propensity to

change judgments over rounds. In all three conditions, high relative expertise was correlated with a low

propensity to change predictions, and this relationship was significant in the Statistical condition

(r=� 0.308, P < 0.05, df = 28) and the Iteration condition (r =� 0.720, P < 0.01, df= 26), although it

failed to reach significance in the Reasons condition (r =� 0.239, P >0.05, df= 26). Similarly, high first-

round accuracy was negatively correlated to proportion-correct accuracy improvement over rounds,

reaching significance at the P < 0.01 level in each condition (i.e., r =� 0.450, � 0.634, and � 0.533 for

Reasons, Statistical, and Iteration, respectively). It appears that those who were the best predictors on the

first round improved the least but had the least need to improve.

These results provide limited support for Hypothesis 2, in that the expected relationship was observed

for the Statistical condition yet not for the Reasons condition (recall that no prediction was made with

regards the relationship between objective expertise and change in the Iteration condition).

6.4. Confidence and accuracy

The mean confidence of subjects in their first-round predictions was calculated and correlated with

mean first-round accuracy (proportion correct). Against Hypothesis 3A no evidence was found that

high confidence was related to high initial accuracy (r =+0.059, P>0.05, df = 84), suggesting, by this

summary measure, that subjects’ confidence was not appropriate. To examine the relationship between

accuracy and confidence in greater detail, calibration graphs were produced and are presented in

Figs. 2–4. These figures are based on grouped data and show the relationship between mean

probability response (i.e., confidence given in percent) and mean proportion correct, on the first and

second rounds, for each of the three conditions. Specifically, to plot each point on the graphs, the

grouped confidence data (e.g., from 30 subjects by 60 items) were converted from percentage to

probability values between 0.5 and 1. The resulting values were then divided into six equal parts

(of 300 items in the Statistical condition, and 280 in the other two conditions). The mean of each of

these parts was taken as the x-coordinate, and the mean proportion correct for the corresponding item

was taken as the y-coordinate.

The figures reveal the nature of subjects’ miscalibration, with all of the curves showing patterns that

are similar to those found generally in studies of probability assessments by groups of individuals (see

Refs. [28,29] for reviews). Specifically, there is slight underconfidence displayed for responses at or

around 0.5 (50%) with increasing overconfidence for higher (confidence) responses. Comparisons

between first- and second-round curves will be considered shortly.

As introduced earlier, a general numerical measure of the appropriateness of confidence is the Brier,

or mean probability, score. The formula for the Brier measure is:

PS ¼ 1

NP

ðd � f Þ2

where N = total number of items, d = outcome index (e.g., 0 = will not happen; 1 = will happen), and

f = probability estimate from 0 to 1 (0 = definitely will not happen; 1 = definitely will happen).

https://www.researchgate.net/publication/232419447_The_calibration_of_subjective_probability_Theories_and_models_1980-94?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/232531220_Subjective_probability_accuracy_analysis?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

Fig. 2. Calibration curves for the two rounds of the Iteration condition.


The Brier PS is a measure of the correspondence between the accuracy of forecasts and confidence in

those forecasts as expressed by subjective probability. Perfect correspondence gives a score of 0 and

counterperfect correspondence a score of 1. Note that, in practice, the score should not exceed 0.25 as

such can be achieved by someone who responds with a subjective probability of 0.5 (50%) for every

event [29]. The mean Brier score across subjects was 0.228 (S.D. = 0.036), which, although significantly

different from ‘chance’ (two-tailed t = 5.65, P < 0.01), only differed from the chance baseline by a small

amount. This suggests that, although subjects’ confidence ratings may be appropriate to a degree, the

extent of this correspondence may be too small to allow first-round confidence to be used as an effective

indicator of objective expertise/accuracy.

6.5. Confidence and propensity to change

Mean first-round confidence was correlated with subjects’ mean proportion of prediction changes

over rounds. Against Hypothesis 3B, high initial mean confidence was not related to low propensity to

change predictions, when correlated within all subjects (r=+0.081, P >0.05, df = 84) or when

correlated within subjects according to feedback condition (r=� 0.222, +0.146, and +0.255 for

Reasons, Statistical, and Iteration, respectively, all P >0.05). However, the obtained correlation

between the change measure and Brier scores did reveal a strong overall relationship (r=+0.289,

P< 0.01, df= 84), suggesting, initially, that subjects who gave confidence values that were more

appropriate changed their predictions the least. Taking into account condition, however, reveals that

this relationship is conditional upon the nature of feedback, reaching significance only in the Iteration

condition (r=+0.536, P< 0.01, df= 26) but not in either the Statistical (r=+0.196, P >0.05, df = 28) or

Reasons conditions (r=+0.305, P>0.05, df = 26).

The various analyses above consider the relationship between first-round confidence estimates, the

appropriateness of these, and a variety of measures. It is also of interest to see how the pattern of


Fig. 3. Calibration curves for the two rounds of the Statistical condition.


confidence appropriateness changes over rounds and how this is associated with the iteration of

estimates and the presence of feedback from others. Figs. 2–4, which report calibration curves based on

grouped data for the three conditions, indicate that calibration improves over rounds in each case, with

the second-round curves being generally closer to the diagonal ideal than those for the first round. This

improvement is clearest for those Statistical conditions where overconfidence is much reduced at the

Fig. 4. Calibration curves for the two rounds of the Reasons condition.

Table 3

Mean Brier scores in the three conditions over rounds

Condition First round Second round t P

Iteration (n = 28) Mean 0.23 0.23 0.85 0.40

S.D. 0.04 0.03

Statistical (n= 30) Mean 0.23 0.22 5.92 < 0.0001

S.D. 0.03 0.03

Reasons (n = 28) Mean 0.22 0.21 3.63 0.0012

S.D. 0.03 0.03


second round (although more underconfidence is displayed on the lower part of the curve). Improvement

over rounds is also observable in the other two conditions—where there are slight reductions in both

under- and overconfidence.

To further consider whether the graphically suggested improvements in confidence appropriateness

are genuine, mean Brier scores were derived for subjects under each condition in each round and were

compared statistically. The results are reported in Table 3. Although mean Brier score differences across

rounds were small in magnitude, they proved significant at the P < .01 level in both the Statistical and

Reasons conditions but not in the Iteration condition.

In essence, these results suggest that not only does accuracy tend to improve across rounds in Delphi-

like conditions (cf. Table 1), but subjects’ confidence assessments also become more appropriate. In

terms of using first-round confidence measures to attribute expertise, however, the results generally

mirror those of Rowe and Wright [9] and suggest that this is not feasible.

6.6. Majority–minority influences

The level of agreement or disagreement of the members of the nominal groups were measured for

each incident of cross-round prediction change to the converse alternative. Initially, because group size

may be an important determinant of change, this measure was derived separately for prediction

changes of subjects in the groups of five and four. In the former case, each change incident was

classified according to the number of ‘supporters’ of each subject’s initial categorical prediction,

classified as zero, one, two, three, or four; while in the latter case, the number of supporters was

limited to either zero, one, two, or three. The observed frequency of prediction change under each

category of ‘support’ was compared to the expected frequency of change under each category, taking

into account the total number of cases in each support category (across all subjects and items and

irrespective of whether change did or did not occur on the second round). The nature of this

relationship is clarified in Table 4 (groups of five) and Table 5 (groups of four), which tabulate the

proportions of changes/no changes under each degree of support. The information in Table 4 concerns

change versus degree of support aggregated across all groups of five and also across the groups of five

decomposed according to feedback condition; while Table 5 presents information regarding the

Reasons condition only, inasmuch as the two groups of four (ignoring the Iteration condition) both

occurred in this condition.

Chi-square analysis between the variables ‘change’ (yes/no) and ‘degree of support’ (0/1/2/3) and,

when appropriate, (4) gave highly significant results for the groups of five (m2 = 681.23, df = 4,

P < 0.01) and the groups of four (m2 = 72.36, df = 3, P < 0.01). Specifically, and in line with

Table 4

Proportion of changes/no changes under different first-round degrees of support (groups of five)

Condition Change Degree of support (groups of 5)

0 1 2 3 4

All Yes 0.369 0.413 0.119 0.067 0.032

No 0.035 0.112 0.214 0.272 0.366

Reasons Yes 0.333 0.441 0.157 0.029 0.039

No 0.040 0.108 0.209 0.281 0.361

Statistical Yes 0.393 0.393 0.093 0.093 0.027

No 0.032 0.115 0.217 0.266 0.370


Hypothesis 4A, the greater the opposition of group members to subjects’ initial first round

predictions is (i.e., the lesser the support), the greater their propensity to change judgments on the

second round becomes. This finding held for the five-member groups under both feedback conditions

(m2 = 252.10 and 438.84, for Reasons and Statistical, respectively, both df= 4, P< 0.01). Analysis of

the relationship between the feedback conditions (in terms of prediction change) and degree of

support suggests that the variables ‘condition’ and ‘support’ are independent (m2 = 6.97, df = 4,

P>0.05) and that the feedback types hence did not differ significantly in the way in which they

effected or mediated the impact of majorities in the nominal groups. There also does not appear to

be any substantial difference in the pattern of results between data from the groups of five and four

(see Tables 4 and 5).

The above analysis shows that majorities exert strong influences on minorities in terms of evoking

categorical prediction changes over rounds. However, the analysis reveals only a general relationship

and does not account for the fact that individuals in minorities do not always change to the consensus

position nor does it account for the fact that change sometimes occurs against the majority viewpoint

(e.g., see Table 4). The next analysis considers a possible explanation in terms of objective first-round

accuracy—an explanation that is implicit in the Theory of Errors. More specifically, the next analysis

focuses upon those test items on round one where the majority of subjects in a nominal group (i.e.,

three or more) were correct, compared to where the majority were incorrect (the ratio of which was

approximately 3:1). Recall that in the Reasons condition, there were four groups of five and two

groups of four, allowing the possibility of draws in the latter case. Such draws were not included in

the analysis, necessitating the removal of 8 and 15 items out of 60 (and hence out of 32 and 60

individual predictions) in the two groups of four, respectively. Table 6 shows the proportion correct

under each feedback condition, under each round, according to the majority accuracy factor, and

summarises the analysis of the changes in accuracy (using one-tailed related t test) in each cell of the

two-by-two matrix.

Table 5

Proportion of changes/no changes under different first-round degrees of support (groups of four)

Condition Change Degree of support (groups of 4)

0 1 2 3

Reasons Yes 0.426 0.383 0.128 0.064

No 0.081 0.176 0.363 0.381

Table 6

Proportion correct in the two feedback conditions over rounds, shown as a function of whether the majority of the nominal

groups were correct or incorrect at the first round

Majority right (first round) Majority wrong (first round)

First

round

Second

round

t P First

round

Second

round

t P

Statistical proportion

correct

Mean 0.82 0.87 4.46 0.0001 0.26 0.22 � 3.10 0.0043

S.D. 0.06 0.06 0.11 0.13

Reasons proportion

correct

Mean 0.84 0.88 1.86 0.074 0.25 0.20 � 1.59 0.12

S.D. 0.08 0.07 0.12 0.12


As can be seen from Table 6, for items where the majority were right at the first round, subsequent

accuracy improved on the second round, a trend that was found to be significant in the Statistical

condition but not in the Reasons condition. By contrast, for items where the majority were wrong at the

first round, subsequent accuracy declined on the second round—and again, this relationship proved

significant for the Statistical condition but not for the Reasons condition. Analysis using two-way

ANOVAs of the differences in the degree of change of proportion-correct accuracy over rounds between

the feedback conditions did not, however, reveal any significant differences, either in the case of

majority right items [F(1,56) = 0.50, P=0.481] or majority wrong items [F(1,56) = 0.14, P=0.713]. In

terms of the hypotheses, these results do not support Hypothesis 4B, in that the influence of the majority

appears to dominate in both conditions, even to the extent of drawing initially accurate individuals

towards their erroneous positions.

7. Discussion

This study has aimed to reproduce and extend the findings of Rowe and Wright [9] and, in doing so,

has attempted to maintain task characteristics across studies (as much as possible) to increase the chance

of generalising results between them. In terms of reproducing the earlier findings, however, the success

of the present study is limited and lends further weight to the general thesis of Rowe and Wright [9]—

that subtle changes in technique composition, group membership, and other situational factors are liable

to have substantial effects on how and whether Delphi will aid judgmental accuracy. Indeed, the present

study appears to demonstrate just how sensitive results are to any manipulation of task environment. For

example, in spite of the fact that both the present study and the previous one involved short-term

forecasting tasks using groups of five (student) subjects who made predictions about newsworthy

political, economic, and international events, the Iteration condition in the present study led to no

improvement in forecasting across rounds when previously it did. Although the two feedback conditions

in the present study are undoubtedly somewhat dissimilar to those conditions in the earlier study, the

same cannot be said with respect to the Iteration condition, in which precisely the same instructions and

subject requirements were used.

The most probable explanation for the above outcome is that there were subtle differences between

the two tasks and sets of performance measures that were not controlled for across studies. One


difference that might be of importance concerns the nature of elicited subject responses: the previous

study required quantitative responses (e.g., the number of seats to be won by a named political party in a

forthcoming election), while the present study required qualitative ones (i.e., a choice of one answer

from two possibilities). For example, it may be that subjects are more amenable to change and

improvement when they are able to alter an estimate that they might recognise, on reconsideration, to be

exaggerated but may feel a certain defensiveness to admitting a definite, categorical, and undisputed

error—something that would be implied in making a prediction change to the converse of two exclusive

and exhaustive options. Another difference might lie in the relative difficulty of the items being forecast

in the two studies. The essential nature of the cross-study task differences here would seem a topic

worthy of further investigation.

Returning to the results of the present study, evidence was found for an improvement in cross-round

accuracy in the two feedback conditions—a trend that was significant in the Statistical case. With respect

to the propensity of subjects to change predictions over rounds, there were no significant differences

between the three experimental conditions, although (as in the previous study) the Iteration condition led

to a higher proportion of mean changes than either of the feedback conditions. To the extent that results

from the two studies are generalisable, this trend indicates that feedback may actually serve as an

inhibitor of change, perhaps by inducing a defensiveness in panellists. Our feedback formats were

deliberately simplified so as not to confound the effects of Statistical and Reasons feedback, but the

generally prescribed Delphi method uses both types, and it would be interesting to assess in a future

study how panellists responded to both types in one process (e.g., whether one feedback type proved

more influential than another).

In terms of the appropriateness of changes over rounds, however, the present results bear little

similarity to those of Rowe and Wright [9]. In the previous study, an association was found between

increasing propensity of subjects to change judgments and increasing accuracy improvement, in both

feedback conditions (but not in the Iteration condition), contrasting with the present results, where a

similar association was revealed in the Statistical and Iteration conditions but not in the Reasons

condition. Rowe and Wright proposed that feedback allows good forecasters to identify themselves,

while providing information to direct the changes of the less expert—an explanation that might still

account for the outcomes of the Statistical condition but does not explain the lack of influence of the

Reasons feedback or the positive association found in the Iteration condition between these measures.

Explanation of these discrepancies appears to require recourse once more to the task characteristics of

the present study. The general ineffectuality of the Reasons feedback across a number of performance

measures may derive from the sheer number of reasons that our subjects were required to generate.

Computation of the Brier measure requires a large number of probability judgments from subjects. This

may have led to a degree of overload on the subjects, with a consequent decrease in the quality of

arguments and their subsequent ability to appropriately influence opinion change in other panellists.

Indeed, it is worth noting that posttask examination of subjects’ written rationales revealed a large

number of reasons that simply reported opinions (‘I think that this is more likely. . .’) rather than causal

arguments (‘I think that this is more likely because. . .’). Information of the former type is arguably less

rich than that of the latter type and arguably less useful for subjects in terms of discriminating relative

expertise.

With respect to the role of ‘objective expertise’ in nominal groups, success at replicating the results of

Rowe and Wright’s study was once more variable. As in the previous study, it was found that high

relative expertise was associated with a low propensity to change predictions over rounds—an


association that was significant in the Statistical condition and a nonsignificant trend in the Reasons

condition. A similar trend was also found in the Iteration condition, although the general effectiveness of

the Iteration approach proved inferior here to that of the feedback approaches in terms of overall

improvement in aggregate accuracy across rounds.

In Rowe and Wright’s study, mean first-round confidence, as obtained through rating scales, did not

appear to be an especially good predictor of objective expertise. In the present study, an attempt was

made to replicate and extend consideration of the role of confidence in nominal groups, particularly by

using more fine-grained measures of the appropriateness of confidence (i.e., Brier scores). As in the

previous study, little evidence was found of any relationship between subjects’ mean first-round

confidence and either their first-round accuracy or their propensity to change predictions over rounds.

This result argues against the use of confidence measures as discriminants of expertise (for panellist

selection). Consideration of calibration graphs revealed the nature of the miscalibration, with subjects

generally exhibiting overconfidence—a bias that has been frequently reported (e.g., [28,29]).

Providing feedback (either Reasons or Statistical) encouraged more appropriate ratings (i.e., better

Brier scores).

The final factor that was considered in the present study was the influence of majorities and

minorities. Unsurprisingly, it was found that the propensity of subjects to change their predictions over

rounds was significantly related to the degree of support or opposition (in the nominal group) for the

subjects’ initial predictions. The expectation, however, that accurate minorities (excluding overt social

group pressure) could pull inaccurate majorities towards the correct position was not realised; majorities,

whether accurate or otherwise, exerted a significant pull on minorities to the consensual position, even

when that position was fallacious.

The pervasive effect of majorities is, however, likely to be influenced by the degree of expertise

possessed by subjects; although subjects showed a fair degree of competence in this particular

forecasting task (with a hit rate of approximately two-thirds correct), the study of the behaviour of

more-expert subjects might conceivably yield different results, with subjects of a higher base level of

expertise perhaps being more able to resist majority influences in appropriate situations. This is

particularly important, inasmuch as it is not Delphi’s intent to force consensus. Also of interest with

regard to majority/minority influence are the effects of different sized groups and hence the potentially

different magnitudes of opposition and support. The impact of relative expertise and group size on

opinion change and judgment accuracy are areas that ought to be considered in future studies.

For practitioners, the implications of the results of this study are several-fold. First, evidence suggests

that confidence is not a good predictor of expertise and hence should not be used as a selection device

(e.g., to choose among a list of experts). Second and perhaps obviously, practitioners should take care in

choosing their experts, inasmuch as evidence suggests that the ‘better’ the expert is, the more

appropriately they are likely to respond to feedback (see also Ref. [13]). Third, they should be aware

that the power of the majority is not totally undermined in Delphi, and hence, convergence of opinion

over Delphi rounds will not necessarily imply improved forecasting accuracy in every case. And fourth,

the water-muddying results of this study suggest that the practitioner should think carefully about the

nature of feedback they provide, what information it might contain, and how their panellists might react

to it, inasmuch as a variety of studies (including this one) give contrary results as to whether iteration

alone, the feedback of statistical averages, or the use of panellist arguments will lead to most improved

performance over rounds. In future studies, we hope to disentangle the complex interactions between

feedback type, expertise, panellist personality, and accuracy measures.





References

[1] N. Dalkey, O. Helmer, An experimental application of the Delphi method to the use of experts, Manage. Sci. 9 (1963)

458–474.

[2] H.A. Linstone, M. Turoff, The Delphi Method: Techniques and Applications, Addison-Wesley, London, 1975.

[3] G. Rowe, G. Wright, Expert opinions in forecasting: role of the Delphi technique, in: J.S. Armstrong (Ed.),

Principles of Forecasting: A Handbook for Researchers and Practitioners, Kluwer Academic Publishing, Norwell, MA,

2001, pp. 125–144.

[4] G. Rowe, G. Wright, The Delphi technique as a forecasting tool: issues and analysis, Int. J. Forecast. 15 (1999) 353–375.

[5] G. Rowe, G. Wright, F. Bolger, The Delphi technique: a reevaluation of research and theory, Technol. Forecast. Soc.

Change 39 (3) (1991) 235–251.

[6] M. Scheibe, M. Skutsch, J. Schofer, Experiments in Delphi methodology, in: H.A. Linstone, M. Turoff (Eds.), The Delphi

Method: Techniques and Applications, Addison-Wesley, Reading, MA, 1975, pp. 262–287.

[7] F.J. Parente, J.K. Anderson, P. Myers, T. O’Brien, An examination of factors contributing to Delphi accuracy, J. Forecast.

3 (2) (1984) 173–182.

[8] M.E. Boje, J.K. Murnighan, Group confidence pressures in iterative decisions, Manage. Sci. 28 (10) (1982) 1187–1196.

[9] G. Rowe, G. Wright, The impact of task characteristics on the performance of structured group forecasting techniques, Int.

J. Forecast. 12 (1996) 73–90.

[10] R.J. Best, An experiment in Delphi estimation in marketing decision-making, J. Mark. Res. 11 (1974) 448–452.

[11] F.J. Parente, J.K. Anderson-Parente, Delphi inquiry systems, in: G. Wright, P. Ayton (Eds.), Judgmental Forecasting,

Wiley, Chichester, 1987, pp. 129–156.

[12] B.L. Bonner, M.R. Baumann, R.S. Dalal, The effects of member expertise on group decision-making and performance,

Org. Behav. Hum. Decis. Process. 88 (2) (2002) 719–736.

[13] M.J. Prietula, P.J. Feltovich, F. Marchak, Factors influencing analysis of complex cognitive tasks: a framework and

example from industrial process control, Hum. Factors 42 (1) (2000) 56–74.

[14] M. Sherif, A study of some social factors in perception, Arch. Psychol. 27 (187) (1935) 1–60.

[15] S. Asch, Effects of group pressure upon the modification and distortion of judgments, in: H. Guetzkow (Ed.), Groups

Leadership and Men, Carnegie Press, Pittsburgh, PA, 1951.

[16] M. Doms, E. van Avermaet, Majority influence, minority influence and conversion effect: a replication, J. Exp. Soc.

Psychol. 16 (1980) 283–292.

[17] S. Mosocvici, E. Loge, M. Naffrechoux, Influence of consistent minority on the responses of a majority in a color

perception task, Sociometry 32 (1969) 365–380.

[18] A. Maas, R.D. Clark, Hidden impact of minorities: fifteen years of minority influence research, Psychol. Bull. 95 (1984)

428–450.

[19] S. Keeney, F. Hasson, H.P. McKenna, A critical review of the Delphi technique as a research methodology for nursing, Int.

J. Nurs. Stud. 38 (2001) 195–200.

[20] J.A. Sniezek, An examination of group process in judgemental forecasting, Int. J. Forecast. 5 (1989) 171–178.

[21] J.A. Sniezek, Groups under uncertainty: an examination of confidence in group decision making, Org. Behav. Hum.

Decis. Process. 52 (1) (1992) 124–155.

[22] J.S. Armstrong, Long Range Forecasting: From Crystal Ball to Computer, 2nd ed., Wiley, New York, 1985.

[23] A.H. Murphy, A new vector partition of the probability score, J. Appl. Meteorol. 12 (1973) 595–600.

[24] S. Lichtenstein, B. Fischhoff, L.D. Phillips, Calibration of probabilities: the state of the art to 1980, in: D. Kahneman, P.

Slovic, A. Tversky (Eds.), Judgement Under Uncertainty: Heuristics and Biases, Cambridge Univ. Press, Cambridge,

1982, pp. 306–334.

[25] G.W. Brier, Verification of forecasts expressed in terms of probability, Mon. Weather Rev. 75 (1950) 1–3.

[26] J.F. Yates, Judgment and Decision Making, Prentice-Hall, Englewood Cliffs, NJ, 1990.

[27] P. Ayton, G. Wright, Assessing and improving judgmental probability forecasts, Omega 15 (1987) 191–196.

[28] A.G.R. McClelland, F. Bolger, The calibration of subjective probabilities: theories and models 1980–1994, in: G. Wright,

P. Ayton (Eds.), Subjective Probability, Wiley, Chichester, 1994, pp. 453–481.

[29] J.F. Yates, Subjective probability accuracy analysis, in: G. Wright, P. Ayton (Eds.), Subjective Probability, Wiley, Chi-

chester, 1994.

https://www.researchgate.net/publication/237035943_The_DELPHI_method_techniques_and_applications?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==







https://www.researchgate.net/publication/245335850_An_Experiment_in_Delphi_Estimation_in_Marketing_Decision_Making?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==





https://www.researchgate.net/publication/232538194_A_Study_of_Some_Social_Factors_in_Perception?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==










https://www.researchgate.net/publication/270942283_Long-Range_Forecasting_From_Crystal_Ball_to_Computer?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/235910409_Verification_of_forecasts_expressed_of_probability?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==

https://www.researchgate.net/publication/23793415_Assessing_and_Improving_Judgemental_Probability_Forecasts?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==














https://www.researchgate.net/publication/303260259_A_new_vector_partition_of_the_probability_score?el=1_x_8&enrichId=rgreq-a837ac45-cde8-4df5-90c7-e74d5d25fb5a&enrichSource=Y292ZXJQYWdlOzI1Njg1OTE5MjtBUzoxODM2NDQxOTA2ODMxMzZAMTQyMDc5NTU5NTA4MA==


Gene Rowe is currently a senior scientist in the Consumer Science Group at the Institute of Food Research, Norwich (UK). His

PhD, gained from the Bristol Business School at the University of the West of England (UWE), concerned the use of nominal

groups to improve human judgment and decision making. Apart from a continuing interest in judgment and decision making,

his research activities and publications have also spanned topics from expert systems and forecasting to risk perception and

public participation. Much of his recent work has focused on the issue of evaluating the effectiveness of public participation

exercises.

George Wright is a professor and the Head of the Department of Management at the Durham Business School. He is an

associate editor of both the International Journal of Forecasting and the Journal of Forecasting. His work has been published

in journals such as Management Science and Strategic Management Journal. His interest is in the role and validity of judgment

in forecasting.

Andy McColl is a doctoral student at the Durham Business School. His research interest is in decision framing and the

behavioral and cognitive processes underlying strategic decision making.

Judgment change during Delphi-like procedures: The role of majority influence, expertise, and confidence

Documents