Munich Personal RePEc Archive Anchoring in Project Duration Estimation Lorko, Matej and Servátka, Maroš and Zhang, Le MGSM Experimental Economics Laboratory, Macquarie Graduate School of Management, Ekonomická Univerzita v Bratislave 13 August 2018 Online at https://mpra.ub.uni-muenchen.de/88456/ MPRA Paper No. 88456, posted 20 Aug 2018 10:08 UTC
34
Embed
Anchoring in Project Duration Estimation · anchoring in task duration or effort estimation, our experiment employs real incentives for estimation accuracy and task performance, which
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
2008), social judgments (Davis, Hoch, & Ragsdale, 1986), self-efficacy (Cervone & Peake, 1986), and
meta-memory monitoring (Yang, Sun & Shanks, 2017).
The influences of anchors have also been found in the domains that are relevant to our research
questions, namely the task duration estimation and the effort estimation.1 For example, estimates of
effort required for software development can be anchored on customers’ (Jørgensen & Sjøberg, 2004)
or managers’ expectations (Aranda & Easterbrook, 2005). The anchoring effect can be introduced also
by a variation of wording instead of using numerical values, as demonstrated by Jørgensen & Grimstad
(2008) who find different work-hours estimates of the same task, labelled “minor extension”,
“extension” or “new functionality” for different treatment groups.2 Moreover, Jørgensen & Grimstad
(2011) observe an anchoring effect in estimates provided by outsourcing companies in a field setting
of software development. In the domain of task duration estimation, König (2005) demonstrates the
anchoring effect on estimates of time needed to find answers to questions in a commercial catalogue.
Before the estimation and actual task completion, subjects are asked to consider whether they need
more or less than 30 (a low anchor) or 90 (a high anchor) minutes to complete the task. Consistent
1 In project management, the duration of the task is often reported in man-hours (man-days) and is referred to as the effort
estimate.
2This manipulation can be considered as framing rather than anchoring. However, software development companies
relatively frequently use terms such as “extension” or “new functionality” to describe workload requiring a specific number
of work-hours. For example, a past employer of one of the authors uses “Minor Enhancement” as a category for every new
work that requires approximately 160 work-hours to complete. Thus, the expression is strongly associated with a particular
number and serves as a powerful anchor for effort estimation.
6
with the hypothesis, the estimates in their low anchor treatment are significantly lower than those in
the high anchor treatment. The actual time in which subjects complete the task is also measured,
however, no significant differences across treatments are found. The author concludes that
“estimating the amount of time needed to complete a task is a fragile process that can be influenced
by external anchors” (p. 255). Similar results are presented by Thomas & Handley (2008), who also
find that significant differences in duration estimates can be caused even by anchors irrelevant to the
estimating problem at hand.
Altogether, there exists considerable scientific evidence of anchoring in task duration and effort
estimation from laboratory experiments, field experiments, and field questionnaires. However, none
of the previous laboratory and classroom experiments incentivized subjects for their estimation
accuracy (only flat fees or course credits were used). The lack of real incentives can cause a
hypothetical bias (e.g., Hertwig & Ortmann, 2001), and it is therefore questionable whether the
anchoring effect is robust when misestimation can cause real losses to the estimator. Indeed, the
relatively low magnitudes of anchoring effect found in the above-mentioned field experiment by
Jørgensen & Grimstad (2011) can possibly be attributed to the fact that companies producing the
estimates were informed that they might be offered additional opportunities for estimation work if
their estimates were accurate. In addition, anchoring studies usually employ one-shot tasks, making
it impossible to study whether the subjects learn from their previous estimation errors caused by
anchors.3 To the best of our knowledge, the influence of anchors in duration estimation has never
been tested in more than one period and we therefore have no knowledge about how the anchoring
effect interacts with the planner’s experience. In fact, despite the extensive body of anchoring-related
research, a relatively little is known about the long-term effect of anchors in general. Ariely et al.
(2003) find differences in willingness to accept the payment for listening to annoying sounds between
treatments, in which subjects are initially anchored on different amounts of money. The anchored
willingness to accept does not converge in repeated estimation, not even in the bidding contest. On
the other hand, Alevy et al. (2015) find convergence of prices by the third period in a market setting,
where subjects are trading baseball cards. However, since there is no “true” value of the price for
3 See Smith (1991) for a nice discussion on the importance of interactive experience in social and economic institutions in
testing rationality (and implicitly the lack of biases in decision-making). While our experiment does not allow for an
interaction (due to the nature of decision-making and the implemented lack of feedback) and is institution-free, it does take
a step in the direction proposed by Smith, namely by testing whether the experience itself is sufficient to eliminate the
anchoring bias.
7
listening to annoying sounds or of the price of baseball cards, we lack sufficient evidence from these
studies to demonstrate the anchoring bias.
Furthermore, in the domain of duration estimation, many of the earlier studies employed relatively
unfamiliar tasks, possibly exacerbating the bias. As suggested by Smith (1991), one might expect that
prior task experience will reduce the influence of nuisance factors, such as anchors, in economic
decision-making. This claim is supported by empirical evidence directly related to anchoring. For
example, Thomas & Handley (2008) show that the subjects who admitted to having performed a
similar task in the past are less affected by the anchors in the experimental setting. Similarly, Løhre &
Jørgensen (2016) find that subjects with a longer tenure in the profession are less influenced by
anchors and thus provide more accurate estimates than less experienced subjects. However, the
anchoring effect is still significant even for the most experienced subjects.
While the persistence of the anchoring effect over time and its correlation with subjects’ task
experience have not been tested, there exists a strand of literature on the effect of experience on the
accuracy of non-anchored duration estimates. However, the results are mixed. On the one hand, more
experience leads to more accurate estimates in tasks such as reading (Josephs & Hahn, 1995),
software development effort (Morgenshtern, Raz, & Dvir, 2007), and running (Tobin & Grondin, 2015).
On the other hand, experienced users tend to underestimate the time needed to complete other
tasks, such as playing piano pieces (Boltz, Kupperman, & Dunne, 1998), using cell phones and
assembling LEGOs (Hinds, 1999) and making origami (Roy & Christenfeld, 2007). Thus, the
effectiveness of the experience on estimation accuracy is likely to be task- or context-specific. Possibly,
the mixed results can be explained by the fact that focusing on the task duration is more important
and salient for some tasks (such as programming or running) than for the others. In a similar vein,
Halkjelsvik & Jørgensen (2012) argue that having experience with the task itself does not necessarily
imply having experience with its duration estimation. When people do not receive feedback regarding
their estimation accuracy (or do not usually estimate the duration of the task in the first place), the
increase in their experience with the task can lead to more optimistic and hence less accurate
estimates. This proposition is supported by experimental results demonstrating that just prompting
for self-generated feedback on the estimation accuracy can reduce future estimation errors (König,
Wirz, Thomas, & Weidmann, 2014). Anchors might also affect individual estimation consistency. For
example, ceteris paribus, the duration estimate by the same person for the same task should be
approximately the same. However, Grimstad & Jørgensen (2007) find relatively large variance
between the estimates of the same tasks provided by the same experienced software professionals.
Can such inconsistency be explained by anchoring effects? Halkjelsvik & Jørgensen (2012, p. 241) note,
8
the re-examination of the experimental data show, that “the high level of inconsistency is to a certain
extent a product of assimilation toward the preceding tasks or judgments.”
Overall, the anchoring effect is found to be a pervasive phenomenon in the domain of task duration
estimation. However, it is not clear whether anchors persist over time and whether these effects
depend on planners’ experience. We design an incentivized experiment to fill these gaps and to
thoroughly examine the prevalence as well as limitations of the influence of anchors. In companies,
the estimates are usually being produced by experienced professionals who are familiar with the task
at hand and the estimation is often repeated. We therefore incorporate experience and repetition in
our experimental design, together with meaningful incentives for task performance and estimation
accuracy. Thus, our study presents a conservative test designed to detect the lower bound of the
anchoring effect. One can imagine that if we observe an anchoring effect in our setup, it would be
even more prevalent in environments characterized by the absence of these features.
3. Experimental design
We conduct an incentivized laboratory experiment employing an individual real-effort task to test
whether numerical anchors influence duration estimates and whether such effects persist over time.
Throughout the experiment subjects are prohibited from using their watches, mobile phones and any
other devices that have time displaying functions. The laboratory premises also contain no time
displaying devices. The clocks on the computer screens are hidden.
The experiment consists of three rounds. In every round, each subject is requested to estimate how
long it will take him to complete the upcoming task before the actual task performance starts. In our
task, an inequality between two numbers ranging from 10 to 99 is displayed on the computer screen
(for sample screenshots, see the Instructions in the appendix) and the subject is asked to answer
whether the presented inequality is true or false. Immediately after the answer is submitted, a new,
randomly chosen inequality appears. The task finishes once the subject provides 400 correct answers.
The advantages of this task are its familiarity (people often compare numbers in everyday life, for
example prices before a purchase), and that it has only one correct answer (out of two options),
making the estimation process simple. The target number of correct answers (400) was calibrated in
a pilot with the goal of finding an average task duration of 600-750 seconds (10 - 12.5) minutes, as the
previous research by Roy, Christenfeld, & McKenzie (2005) suggests that tasks exceeding 12.5 minutes
are usually characterized by underestimation whereas tasks shorter than that are usually
9
overestimated. All in all, the design creates a favorable environment for subjects to estimate the
duration accurately.
Subjects perform similar two-digit number comparisons in each round. To test whether people are
able to overcome the anchoring effect by learning from the experience itself, we provide no feedback
regarding the actual duration or estimation accuracy between rounds. Such design captures a
common problem of project management present in many companies, namely that project planners
do not receive detailed feedback of the actual hours spent by project team members on each task.
Even if the actual durations of project activities are evaluated against the project plan, the scheduled
delays are often attributed to factors other than inaccurate estimation. Both no feedback and
inadequate feedback makes a project planner unlikely to improve his duration estimates.
In the experiment, subjects are financially incentivized for both their estimation accuracy and task
performance. The incentive structure is designed to motivate subjects to estimate the task duration
accurately, but at the same time to work quickly and avoid mistakes. While the main objective of the
experiment is to test the estimation accuracy, our research question requires incentivizing both
accuracy and performance. Without incentivizing the task performance, subjects could deliberatively
provide high estimates and then adjust their pace in order to maximize their accuracy earnings.
Providing incentives for performance creates an environment analogous to duration estimation in
project management where the goal is not only to produce an accurate project schedule, but also to
deliver project outcomes as soon as possible (holding all other attributes constant). Since there are
two dimensions of incentives, there is a concern that subjects might try to create a portfolio of
accuracy and performance earnings. While one can control for the portfolio effect by randomly
selecting one task for payment (Cox, Sadiraj, & Schmidt, 2015; Holt, 1986), we choose to incentivize
subjects for both tasks and minimize the chances of subjects constructing a portfolio by a careful
experimental design and selection of procedures. First, subjects are not able to track time throughout
the entire experiment. Second, our software is programmed so as to provide neither the count of
correct answers nor the total attempts. Both design features make it unlikely for subjects to
strategically control their pace and match it with their estimates.4 We use a linear scoring rule to
incentivize both estimation accuracy and task performance earnings. We acknowledge that the linear
scoring rule might not be the most incentive compatible one, it is arguably most practical to implement
4 It is possible that the results could be different if we implemented the pay-one-randomly payoff protocol. We therefore
elicit subjects’ risk preferences using an incentivized risk attitude assessment (Holt & Laury, 2002) about which they are only
informed after the completion of all three rounds. We use these preferences to control for the degree of subjects’ risk
preferences in a regression analysis.
10
in an experimental environment than more complex scoring rules (e.g. quadratic or logarithmic) due
to ease of explanation to subjects (Woods & Servátka, 2016).
The estimation accuracy earnings depend on the absolute difference between the actual task duration
and the estimate. In every round, the maximum earnings from the precise estimate are AUD 4.50. The
estimation accuracy earnings decrease by AUD 0.05 for every second deviated from the actual task
duration, as shown in Equation (1). However, we do not allow for negative estimation accuracy
earnings. Thus, if the difference between the actual and the estimated time in either direction exceeds
90 seconds, the estimation accuracy earnings are zero for the given round.5 This particular design
feature is implemented because our expectations of a strong anchoring bias and the related
estimation inaccuracy that could cause many subjects to end up with negative (and possibly large
negative) earnings. Our setting parallels a common practice in companies where planners are praised
or rewarded for their accurate estimates of successful projects but are usually not penalized for
inaccurate estimates when a project fails.
Estimation earnings = 4.50 − 0.05 ∗ |actual time in seconds − estimated time in seconds| (1)
The earnings from task performance, presented in Equation (2), depend on the actual task duration
as well as on the number of correct and incorrect answers. The shorter the duration, the higher the
earnings. We penalize subjects for incorrect answers in order to discourage them from fast random
clicking. Such design is parallel to the business practice where not only the speed but also the quality
that matters. We expected subjects to complete the task within 10-12.5 minutes and thus earn
between AUD 3.70 and 4.70 per round for their performance, making the task performance earnings
comparable with estimation accuracy earnings.
Performance earnings = 7 ∗ (number of correct answers−number of incorrect answers)actual time in seconds (2)
The experiment consists of three treatments (Low Anchor, High Anchor, and Control) implemented in
an across-subjects design, meaning that each subject is randomly assigned to one and only one
treatment. In contrast to most of the extant studies on numerical anchoring, we include a Control
treatment that allows us to test for a general estimation bias and the possibility of “self-anchoring,”
5 The 90-second threshold was derived from the observed task duration in pilots (600-750 seconds). The project management
methodology for estimating requires the definite task estimates to fall within the range of +/- 10% from the actual duration
(Project Management Institute, 2013). We increased this range to 12-15% to make estimation accuracy earnings more
attractive.
11
i.e. whether the first estimate anchors future estimates of the same task. In addition, we use estimates
from the Control treatment to calibrate the low and high anchor values. The low anchor value is set
at the 7th percentile and the high anchor value at 93rd percentile of the Control treatment estimates,
in line with the procedure for measurement of anchoring effect proposed by Jacowitz & Kahneman
(1995). The implemented values are 3 and 20 minutes. The Low Anchor and High Anchor treatments
are conducted according to the same experimental procedures as the Control treatment. However,
before the Round 1 (and only before the Round 1) subjects answer an additional question containing
the anchor, in the following form:
Will it take you less or more than [the anchor value] minutes to complete the task?
4. Hypotheses
We hypothesize that anchors influence the estimates in Round 1. Specifically, estimates in the Low
Anchor treatment are significantly lower than those in the Control treatment, and estimates in the
High Anchor treatment will be significantly higher than those in the Control treatment. Furthermore,
since our subjects do not receive feedback on their estimation accuracy, we expect the anchoring
effect to carry over to subsequent estimates in Round 2 and Round 3.
Hypothesis 1
o 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐿1 < 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐶1 < 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐻1
o 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐿2 < 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐶2 < 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐻2
o 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐿3 < 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐶3 < 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐻3 , where the superscript (1, 2, or 3) refers to Round 1, 2, and 3, respectively and the subscript (L, C, or H) refers to the Low Anchor, Control, and High Anchor treatment.
Since the subjects are incentivized not only for their estimation accuracy but also for how quickly they
can finish the task, we expect them to work as fast as they can, independently of the treatment. In
other words, we hypothesize that anchors do not have any effect on the actual task duration.
Hypothesis 2
o 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐿1 = 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐶1 = 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐻1
o 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐿2 = 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐶2 = 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐻2
o 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐿3 = 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐶3 = 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐻3
12
By combining Hypotheses 1 and 2, we expect an underestimation of task duration in the Low Anchor
treatment but an overestimation in the High Anchor treatment. This is due to the presence of the
anchoring effect in estimation but not in task performance. Since subjects are not exposed to an
anchor in the Control treatment (and since our design provides favorable conditions for unbiased
estimates), we expect to find no systematic bias in task duration estimates in the Control treatment.
Hypothesis 3
o 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐿𝑡 < 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐿𝑡
o 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐶𝑡 = 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐶𝑡
o 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝐻𝑡 > 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝐻𝑡 𝑤ℎ𝑒𝑟𝑒 𝑡 = 1, 2, 3
5. Main results
A total of 93 subjects (45 females, with a mean age of 20.7 and a standard deviation of 4.5 years)
participated in the experiment that took place in the Vernon Smith Experimental Economics
Laboratory at the Macquarie Graduate School of Management in Sydney.6 Subjects were recruited via
the online subject-pool database system ORSEE (Greiner, 2015). The experiment was programmed in
zTree software (Fischbacher, 2007). After completing all three rounds, subjects provided answers to a
few questions about the task, completed the risk assessment, and the demographical questionnaire.
At the end of the experiment, subjects privately and individually received their experimental earnings
in cash. The average subject spent 45 minutes in the laboratory and earned AUD 16.50.
First, we present the results from data aggregated across all three experimental rounds. The
distribution of the actual task duration displays a skewed-shape distribution with asymmetric
truncation, typical in the domain of task performance (see Figure 1a).7 The distribution of estimates
(see Figure 1b) follows a similar pattern, however, the skewness is less pronounced, mostly because
of the inflated estimates in the High Anchor treatment. The Shapiro-Wilk test of normality indicates
6 One subject was dropped from the sample because of her lack of comprehension. She repeatedly estimated the duration
of the entire experimental session (i.e. the sum of all three rounds) instead of each round. The subject was debriefed while
getting paid, and we discovered her poor command of English. When asked about the actual duration of the third round
only, after the experiment, she made the same mistake again. Removing this data point does not change the treatment effect
results anyway.
7 The distribution of performance is usually skewed to the left because of the lower bound on the possible task duration. In
our case there exists a minimum time in which it is possible for a human to provide 400 correct answers.
13
that the distributions are not normal (p<0.001 for both pooled actual duration and estimates); we
therefore analyze the treatment effects using non-parametric tests.8
Figure 1a. The distribution of pooled task duration Figure 1b. The distribution of pooled estimates
Next, we analyze subjects’ behavior in each round (see Table 1 for summary statistics). In line with our
Hypothesis 1, the estimates in the Low Anchor treatment are the lowest, whereas those in the High
Anchor treatment are the highest across all three experimental rounds. However, the absolute
differences diminish from one round to another (see Figure 2a). We analyze the changes using the
Wilcoxon matched-pairs signed-rank test and find that the estimates in the Low Anchor treatment rise
over time (statistically significantly from Round 1 to Round 2 but insignificantly from Round 2 to Round
3), while the estimates in the High Anchor treatment gradually fall over time, with statistically
insignificant decrease between rounds. The estimates in the Control treatment are relatively stable,
consistently positioned between the estimates of the Low Anchor and High Anchor treatments and do
not change significantly from one round to another. Using the Mann-Whitney test (p-values are listed
in Table 2), we find that differences between the estimates in the Low Anchor and High Anchor
treatments are statistically significant across all three rounds, supporting our Hypothesis 1 that the
anchoring effect persists over time. Even though subjects display some degree of learning from the
task experience and move their estimates away from the anchor, the adjustment is insufficient and
the anchoring effect diminishes rather slowly.
Result 1: The anchors influence the task duration estimates and the anchoring effect persists
over time.
Table 1. Descriptive statistics
8 Parametric tests yield qualitatively similar results, indicating the robustness of our findings. The details are available upon
request.
14
Treatments
Rounds
Low Anchor (N = 31) Control (N = 27) High Anchor (N = 35)
Thank you for coming. Please, put away your watches, mobile phones, and any other devices that show time.
The experimenter will check the cubicles for the presence of time showing devices before the start of the
experiment.
Also, please note, that from now, until the end of the experiment, no talking or any other unauthorized
communication is allowed. If you violate any of the above rules, you will be excluded from the experiment and
from all payments. If you have any questions after you finish reading the instructions, please raise your hand.
The experimenter will approach you and answer your questions in private.
Please, read the following instructions carefully. The instructions will provide you with information on how to
earn money in this experiment.
The experimenters will keep track of your decisions and earnings by your cubicle number. The information about
your decisions and earnings will not be revealed to other participants.
Three rounds of the same two tasks
The experiment consists of three rounds. In each round, you will perform two tasks – the comparison task and
the estimation task.
The comparison task
The screen will show an inequality between two numbers ranging from 10 to 99. You will evaluate whether the
presented inequality is true or false. Immediately after you submit your answer, a new inequality will show up.
This task finishes after you have provided 400 correct answers.
Examples:
The estimation task
At the beginning of each round, you will be asked to estimate how long it will take you to complete the
comparison task, that is, how long it will take you to provide 400 correct answers.
33
The earnings structure
Your total earnings (in AUD) from the experiment will be the sum of your comparison task earnings and
estimation task earnings for all three rounds.
The comparison task earnings (CTE)
In each round, your comparison task earnings (in AUD) will be calculated as follows:
Comparison task earnings = 7 ∗ (number of correct answers − number of incorrect answers)actual time in seconds Your comparison task earnings will depend on the actual time in which you complete the task and on the number
of correct and incorrect answers you provide. Notice that the faster you complete the task (i.e., provide 400
correct answers), the more money you earn. However, note also that your earnings will be reduced for every
incorrect answer that you provide.
The estimation task earnings (ETE)
In each round, your estimation task earnings (in AUD) will be calculated as follows:
Estimation task earnings = 4.5 − 0.05 ∗ |actual time in seconds − estimated time in seconds| ˟
˟ If the difference between your actual and estimated time is more than 90 seconds (in either direction) your
estimation task earnings will be set to 0 for the given round.
Notice that the estimation task earnings will depend on the accuracy of your estimate. The calculation is based
on the absolute difference between the actual time in which you complete the comparison task and your