-
Comparing Probabilistic Forecasting Systems with the Brier
Score
CHRISTOPHER A. T. FERRO
School of Engineering, Computing and Mathematics, University of
Exeter, Exeter, United Kingdom
(Manuscript received 13 April 2006, in final form 16 January
2007)
ABSTRACT
This article considers the Brier score for verifying
ensemble-based probabilistic forecasts of binary events.New
estimators for the effect of ensemble size on the expected Brier
score, and associated confidenceintervals, are proposed. An example
with precipitation forecasts illustrates how these estimates
supportcomparisons of the performances of competing forecasting
systems with possibly different ensemble sizes.
1. Introduction
Verification scores are commonly used as numericalsummaries for
the quality of weather forecasts. Generalintroductions to forecast
verification are given by Jol-liffe and Stephenson (2003) and Wilks
(2006, chapter7). There are many situations in which we may wish
tocompare the values of a verification score computed fortwo sets
of forecasts: to compare the quality of forecastsfrom a single
forecasting system at different times orlocations, or in different
meteorological conditions, orto compare the quality of forecasts
from two forecast-ing systems. Several authors have recommended
thatmeasures of uncertainty for the scores, such as standarderrors
or confidence intervals, should be computed toaid such comparisons.
Woodcock (1976), Seaman et al.(1996), Kane and Brown (2000),
Stephenson (2000),Thornes and Stephenson (2001), and Wilks (2006,
sec-tion 7.9) propose confidence intervals for scores of
de-terministic binary-event forecasts. Bradley et al. (2003)use
simulation to compare the biases and standard er-rors of different
estimators for several scores of proba-bilistic binary-event
forecasts, but do not discuss esti-mators for the standard errors.
Hamill (1999) takes adifferent approach and proposes hypothesis
tests forcomparing the scores of two sets of deterministic
orprobabilistic forecasts; see also Briggs and Ruppert(2005).
Jolliffe (2007) reviews this work and related
ideas, and also presents confidence intervals for thecorrelation
coefficient.
Woodcock (1976) explains the motivation for theseattempts to
quantify uncertainty. The value of a scoredepends on the choice of
target observations, so thesuperiority of one forecasting system
over another asgauged by their verification scores computed for
onlyfinite samples of forecasts and observations cannot
bedefinitive: the superiority may be reduced or even re-versed were
the systems applied to new target observa-tions. The methods listed
previously estimate the varia-tion that would arise in the value of
a verification scorewere forecasts made for different sets of
potential tar-get observations, and thereby quantify the
uncertaintyabout some “true” value that would be known
wereforecasts available for all potential observations. Weshall
consider uncertainty in expected values of theBrier verification
score (Brier 1950) in the case of en-semble-based probabilistic
binary-event forecasts. Un-biased estimators for the expected Brier
score thatwould be obtained for any ensemble size are defined
insection 2. Standard errors and confidence intervals aredeveloped
in section 3, and their performance is as-sessed with a simulation
study in section 4. Methods forcomparing the Brier scores of two
forecasting systemsare presented in section 5. Confidence intervals
appro-priate for comparing Brier scores of two systems
simul-taneously for multiple events and sites are described
insection 6.
The methods are illustrated throughout the paperwith seasonal
precipitation forecasts from the Develop-ment of a European
Multimodel Ensemble System forSeasonal to Interannual Prediction
(DEMETER) project
Corresponding author address: C. Ferro, School of
Engineering,Computing and Mathematics, University of Exeter,
HarrisonBldg., North Park Rd., Exeter EX4 4QF, United
Kingdom.E-mail: [email protected]
1076 W E A T H E R A N D F O R E C A S T I N G VOLUME 22
DOI: 10.1175/WAF1034.1
© 2007 American Meteorological Society
WAF1034
-
(Palmer et al. 2004). In particular, 3-month-ahead,nine-member
ensemble forecasts of total October pre-cipitation produced by the
European Centre forMedium-Range Weather Forecasts (ECMWF)
andMétéo-France (MF) global atmosphere–ocean coupledmodels are
compared with observations recorded atJakarta (6.17°S, 106.82°E)
for the years 1958–95. Dataare missing for 1983, leaving 37 yr. The
ensembles weregenerated by sampling independent perturbations ofthe
initial ocean state. The Jakarta observations and theforecasts from
the corresponding grid box are shown inFig. 1.
2. Expected Brier scores
a. Definitions
We define the Brier score together with notationthat will be
used throughout the rest of the paper. Let{Xt : t � 1, . . . , n}
be a set of n observations, and let{(Yt,1, . . . , Yt,m) : t � 1, .
. . , n} be a corresponding setof m-member ensemble forecasts. For
each time t, sup-pose that we issue a probabilistic forecast, Q̂t,
for theevent that observation Xt exceeds a threshold u. TheBrier
score for this set of forecasts, equal to one-half ofthe score
originally proposed by Brier (1950), is themean squared difference
between the forecasts and theindicator variables for the event;
that is,
B̂m,n �1n �t�1
n
�Q̂t � It�2,
where It � I(Xt � u), I(A) � 1 if A is true, and I(A) �0 if A is
false. All summations will be over t � 1, . . . , nunless otherwise
specified.
We mentioned in the previous section that the varia-tion in
verification scores caused by the choice of targetobservations
leads to uncertainty about the quality ofthe forecasting system.
One quantity of interest that wemay be uncertain about is the
expected Brier score,denoted
Bm � E�B̂m,n�, �1�
and defined as the average Brier score over repeatedsamples from
a population of observations and fore-casts. This population can be
defined implicitly by as-suming that the available sample of
observations andforecasts is in some sense representative of the
largerpopulation. We assume that the population is a station-ary
sequence of which our data form a partial realiza-tion. This is
likely to be a good approximation in astable climate and could be
generalized for a changingclimate by assuming, for example, that
the data are apartial realization of a nonstationary, multivariate
timeseries model chosen to represent climatic change. Weshall
concentrate on Bm, but other quantities, such asthe conditional
expected Brier score discussed in ap-pendix A, may also be of
interest.
b. The effects of ensemble size
We investigate how Bm depends on the ensemble sizem. By
stationarity, the expectation of (Q̂t � It)
2 is thesame for all t, so we can write
Bm � E ��Q̂ � I�2�,
where Q̂ and I are the forecast and the event indicatorfor an
arbitrary time. This expectation is an averageover all possible
values of the observation variable Xand the ensemble Y � (Y1, . . .
, Ym). Now let Z denotesufficient information about the forecasting
model todetermine a probability distribution for Y, given whichY is
independent of X. This information might be themodel specification
plus a probability distribution forits initial conditions, for
example. The law of total ex-pectation (e.g., Severini 2005, p. 55)
says that we canobtain Bm in two stages: first by taking the
expectationwith respect to X and Y when Z is held fixed, and
thenaveraging over Z. This is written
Bm � EE ��Q̂ � I�2 |Z�.
FIG. 1. Observed (vertical lines) October rainfall (mm)
inJakarta from 1958 to 1995 plotted between both the ECMWF(filled
circles) and MF (open circles) nine-member ensemble fore-casts.
OCTOBER 2007 F E R R O 1077
-
The interior, conditional expectation is
E��Q̂ � I�2 |Z� � E�Q̂2 |Z� � 2E�Q̂I |Z� � E�I |Z�
� E�Q̂2 |Z� � 2E�Q̂ |Z�P � P, �2�
where P � E(I |Z) � Pr(X � u |Z) is the probabilitywith which
the event occurs.
We must specify how Q̂ is formed from the ensemblemembers in
order to reveal the effects of ensemble size.We choose forecasts
equal to the proportion of mem-bers that exceed a threshold �,
possibly different fromu; that is,
Q̂ �K
m�
1m �i�1
m
I�Yi � ��, �3�
where K is the number of members exceeding �. Alter-native
forecasts could be considered, for example, (K �a)/(m � b) with b �
a � 0, although these lead to morecomplicated formulas later.
For simplicity, we assume that the members withinany particular
ensemble are exchangeable. Exchange-ability means that the members
are indistinguishable bytheir statistical properties: their joint
distribution func-tion is invariant to relabeling the members. This
admitshomogeneous dependence between members and in-cludes the
special case of independent and identicallydistributed members.
Exchangeability implies that all members of an en-semble exceed
� with the same probability,
Pr�Yi � � |Z� � Q for all i,
and all pairs of members jointly exceed � with the
sameprobability,
Pr�Yi � �, Yj � � |Z� � R for all i � j.
Taken together with the forecast definition [(3)], wehave
E�Q̂ |Z� �1m �i�1
m
Pr�Yi � � |Z� � Q, �4�
E�Q̂2 |Z� �1
m2 �i�1m
�j�1
m
Pr�Yi
� �, Yj
� � |Z�
�Q
m�
m � 1m
R, �5�
and the conditional expectation [(2)] equals
R � 2PQ � P �1m
�Q � R�.
Finally, we take the expectation with respect to Z toobtain
Bm � E�R� � 2E�PQ� � E�P� �1m
E�Q � R�.
�6�
Because P, Q, and R are independent of m, this expres-sion
describes completely the effects of ensemble size.Moreover, the
final term is non-negative because R �Q. As the ensemble size
increases, Bm therefore de-creases monotonically to the expected
Brier score, B,that would be obtained for an infinite ensemble
size,and we can write
BM � B� �1M
E�Q � R�, �7�
where BM is the expected Brier score that would beobtained for
an ensemble of size M. This generalizesthe relationship found by
Richardson [2001, Eq. (9)] forindependent ensemble members, in
which case R � Q2.
c. Unbiased estimators
The Brier score B̂m,n is an unbiased estimator for Bmby
definition [(1)] but is biased for BM when M � m.Estimating BM from
ensembles of size m is useful forcomparing forecasting systems with
different ensemblesizes or for assessing the potential benefit of
larger en-sembles (cf. Buizza and Palmer 1998). Equations (4)and
(5) can be used to show that an unbiased estimatorfor BM is
B̂M,n � B̂m,n �M � m
M�m � 1�n �Q̂t�1 � Q̂t�, �8�and letting M → yields an unbiased
estimator for B.
1) REMARK 1
The new estimator [(8)] is undefined if m � 1, inwhich case an
unbiased estimator for BM (M � 1) doesnot exist because the
forecasts contain no informationabout the effects of ensemble size.
Mathematically, forany function h(K, I) independent of R,
E�h�K, I� |Z� � h�0, I��1 � Q� � h�1, I�Q
cannot contain the required R term. Richardson (2001)does,
however, develop a method for estimating a skillscore based on BM
given an ensemble of any size, evenm � 1. He achieves this by
assuming independent en-semble members (R � Q2) and perfect
reliability (Q �P) in which case the expression [(6)] for Bm
becomes
Bm �m � 1
mE �Q�1 � Q��,
and so
1078 W E A T H E R A N D F O R E C A S T I N G VOLUME 22
-
BM �m�M � 1�M�m � 1�
Bm. �9�
In this special case, an unbiased estimator can thereforebe
obtained for BM even when m � 1 by substitutingB̂m,n for Bm in the
right-hand side of (9).
2) REMARK 2
The adjustment term in the definition [(8)] of B̂M,ndepends on
only the forecasts and is a measure ofsharpness (e.g., Potts 2003).
Let
S �1n ��Q̂t � 12�2
be the sample variance of the forecasts around one-half: as S
decreases from its maximum value (1/4) to itsminimum value (0),
forecasts become more concen-trated around one-half and the
sharpness decreases.Now,
B̂M,n � B̂m,n �M � m
M�m � 1� �14 � S�,so B̂M,n reduces the Brier score by amounts
that dependon the estimated sharpness and the ensemble size, m.For
fixed sharpness, the improvement in forecast qual-ity from
increasing the ensemble size decreases as mincreases: the law of
diminishing returns. For fixed m,the improvement decreases as the
sharpness increases,suggesting that the improvement may be
attributed tothe opportunity to shift forecasts slightly farther
awayfrom one-half.
3) REMARK 3
The Brier score B̂m,n is proper (e.g., Wilks 2006, p.298)
because, if our belief in the occurrence of the event{It � 1}
equals p ∈ [0, 1], then the expected contributionto the Brier score
with respect to this belief from issuingforecast Q̂t, that is,
E��Q̂t � It�2� � Q̂t
2�1 � p� � �Q̂t � 1�2p,
is minimized by choosing Q̂t � p. Similar calculationsshow that
B̂M,n is improper when M � m because theoptimum forecast is
then
p �m � M
m�M � 1� �12 � p�.
Therefore, B̂M,n should not be used in situations whereit could
be hedged.
d. Exchangeability
We assumed that ensemble members were exchange-able and
independent of observations given suitable in-formation, Z. The
latter assumption is hard to contestbecause Z can include the full
specification of the fore-casting model and its inputs.
Exchangeability is morerestrictive and would be violated were one
member bi-ased relative to the other members, for example, orwere
one pair of members more strongly correlatedthan other pairs.
Exchangeability might be justified by the processgenerating the
ensemble. For example, exchangeabilitywill hold if the initial
conditions for the members arerandomly sampled from a probability
distribution. Ex-changeability is also likely to hold if the
forecast leadtime is long enough for any initial ordering or
depen-dence between the members to be lost. This latter ar-gument
seems appropriate for our 3-month-ahead rain-fall forecasts.
Exchangeability might also be justified by empiricalassessment.
Romano (1988) describes a bootstrap testfor exchangeability based
on the maximum distance be-tween the empirical distribution
functions of the mem-bers and permuted members. Applying this test
for theECMWF and MF ensemble forecasts of Jakarta rainfallgave p
values of 0.26 and 0.24, which is only weak evi-dence for rejecting
exchangeability.
The effect of ensemble size on the expected Brierscore can still
be estimated even when exchangeabilityis unjustifiable. If we wish
to estimate BM for a subset ofM � m members, then an unbiased
estimator is simplythe Brier score evaluated for the forecasts
constructedfrom those M members. This approach is straightfor-ward
to implement for any verification score, but isinapplicable when M
� m.
e. Data example
We estimate the expected Brier scores Bm and B forthe ECMWF and
MF rainfall forecasts at a range ofevent thresholds u and �. These
are shown in Fig. 2,where we set � � u and let u range from the 10%
to the90% quantiles of the observed rainfall. The MF fore-casts
appear to have significantly lower Brier scoresthan do those of the
ECMWF for thresholds below 90mm (about the median observed
rainfall), and the twosystems have similar Brier scores at higher
thresholds.The estimated difference between Bm for MF and Bfor
ECMWF is also large below 90 mm, suggesting thatincreasing the
ECMWF ensemble size would not besufficient to match the MF Brier
score.
OCTOBER 2007 F E R R O 1079
-
3. Sampling variation
a. Standard errors
Point estimates of expected Brier scores were pre-sented in the
previous section. In this section, we esti-mate the uncertainty
associated with these estimatesdue to sampling variation. In
particular, we shall esti-mate standard errors and construct
confidence intervalsfor the expected scores. We assume only that
the dataare stationary; we no longer need to assume
exchange-ability or a particular form [(3)] for the forecasts.
Weshall consider only B̂M,n, the estimator [(8)] for BM,because
other estimators can be obtained as specialcases by changing M.
We can write B̂M,n as the sample mean of the sum-mands
Wt � �Q̂t � It�2 �
M � m
M�m � 1�Q̂t�1 � Q̂t�.
If the interval between successive times t is large, thenwe may
be justified in making the further assumptionthat these summands
are independent, in which casethe standard error of B̂M,n is
estimated by
�̂M,n ����Wt � B̂M,n�2n�n � 1� .
If the summands are dependent, then estimates of
serialcorrelation may be incorporated into the standard error(e.g.,
Wilks 1997).
There is little evidence for serial dependence in thesummands of
the Brier scores for our ECMWF and MFrainfall forecasts. For
example, a two-sided test for thelag-one autocorrelation (e.g.,
Chatfield 2004, p. 56) wasconducted for both the ECMWF and MF data
at eachof nine thresholds � � u ranging from the 10% to the90%
quantiles of the observed rainfall, and only one pvalue was smaller
than 0.1. We assume serial indepen-dence hereafter. Standard errors
for the estimates ofBm and B are shown in Fig. 2 and are large
enough tocall into question the statistical significance of the
dif-ferences noted previously in the quality of the ECMWFand MF
forecasts. These differences are assessed moreformally in section
5.
b. Confidence intervals
More informative descriptions of uncertainty are af-forded by
confidence intervals, which we now con-struct. Unless the summands
of B̂M,n exhibit long-rangedependence, we can expect a central
limit theorem tohold and imply that B̂M,n is approximately
normallydistributed when n is large. An approximate (1 �
2�)confidence interval for BM would then be
B̂M,n � �̂M, nz,
where z� is the � quantile of the standard normal
dis-tribution.
An alternative approximation to the distribution ofB̂M,n is
available via the bootstrap method (e.g., Davi-son and Hinkley
1997). To obtain confidence intervalsfor BM, the distribution of
the studentized statistic
Tn � �B̂M,n � BM��̂M,n
is approximated by a bootstrap sample {Tn*i : i � 1, . . . ,
r}. If summands are independent, then this sample canbe formed
by repeating the following steps for each i �1, . . . , r:
1) Resample W*s uniformly and with replacement from{Wt : t � 1,
. . . , n} for each s � 1, . . . , n.
2) Set B̂*M,n � �W*t /n and
�̂*M,n ����W*t � B̂*M,n�2n�n � 1� .3) Set Tn*
i � (B̂*M,n � B̂M,n)/�̂*M,n.
Block bootstrapping (e.g., Wilks 1997) can be em-ployed if the
summands are serially dependent. Boot-strap (1 � 2�) confidence
intervals are then defined bythe limits
FIG. 2. Estimates B̂m,n (upper thick) and B̂,n (lower thick)
forthe ECMWF (solid) and MF (dashed) forecasts of October rain-fall
at Jakarta exceeding different thresholds during the period1958–95.
Thresholds are marked as quantiles (lower axis) andabsolute values
(mm, upper axis) of the observed rainfall. Stan-dard errors (upper
thin) and their conditional versions (lowerthin; see appendix A)
are shown for the ECMWF (solid) and MF(dashed) forecasts, and are
indistinguishable for B̂m,n and B̂,n.
1080 W E A T H E R A N D F O R E C A S T I N G VOLUME 22
-
B̂M,n � �̂M,nT*n�r�1�k� and B̂M,n � �̂M,nT*n
�k�,
where k � ��r� and Tn*(1) � . . . � T*(r)n are the
orderstatistics of the bootstrap sample. Neither the normalnor the
bootstrap confidence limits are guaranteed tofall in the interval
[0, 1], so they will always hereafter betruncated at the end
points.
These confidence intervals can be used to test hy-potheses of
the form BM � b, for some reference valueb that represents minimal
forecast quality. If a two-sided (1 � �) confidence interval for BM
does not con-tain b, then the hypothesis is rejected in favor of
thetwo-sided alternative hypothesis BM � b at the 100�%level. One
common reference value for Bm is the Brierscore, q2 � (1 � 2q)�It
/n, obtained if the same prob-ability q is forecast at every time
t. Another is the ex-pected Brier score, (2m � 1)/(6m) or 1/3,
obtained ifthe forecast at each time t is selected
independentlyfrom a uniform distribution on either the set {i/m : i
�0, . . . , m} or the interval [0, 1].
The dark gray bands in the top two panels of Fig. 3are
bootstrapped 90% confidence intervals (using r �5000) for Bm for
the ECMWF and MF rainfall forecasts.The ECMWF forecasts are
significantly worse than cli-
matology (the constant forecast q � �It /n) at the 10%level for
a few thresholds, but are significantly betterthan random forecasts
except between the 30% and70% quantiles (50–130 mm). The MF
forecasts are notsignificantly different from climatology at any
thresh-old, but are significantly better than random
forecastsexcept between the 45% and 65% quantiles (70–110 mm).
4. Simulation study
a. Serial independence
We compare the performances of the proposed nor-mal and
bootstrap confidence intervals for Bm with asimulation study. The
performance of an equitailed(1 � 2�) confidence interval is
commonly assessed byits achieved coverage and average length in
repeatedsimulated datasets for which the true value of Bm isknown.
Let B̂i be the point estimate and let Li and Ui bethe lower and
upper confidence limits computed fromthe ith of N datasets. The
achieved lower and uppercoverages are the proportions of times that
Bm fallsabove and below the lower and upper limits; that is,
N�1�i�1
N
I�Bm � Li� and N�1�
i�1
N
I�Bm � Ui�,
which should both equal 1 � �. The average length isthe mean
distance between the lower and upper limits;that is,
N�1�i�1
N
�Ui � Li�,
which should be as small as possible.The performance of the
confidence intervals depends
on the ensemble size m, the sample size n, the thresh-olds u and
�, the target coverage defined by �, and thejoint distribution of
the observations and forecasts. Weexamine the effects of all of
these factors in this simu-lation study, although a complete
investigation is im-possible. Serially independent observations are
simu-lated from a standard normal distribution. Ensemblemembers are
also normally distributed, and each has acorrelation � with its
contemporary observation but isotherwise independent of the other
members. Forecastsare simple proportions [(3)] and we use
thresholds u �� equal to p quantiles of the standard normal
distribu-tion. We consider the following values for the
variousfactors: m � 2, 4, 8; n � 10, 20, 40; p � 0.5, 0.7, 0.9;� �
0, 0.4, 0.8; and � between 0.005 and 0.05. Results forp � 0.1 and
0.3 would be the same as for p � 0.9 and0.7, respectively, because
the former could be obtainedby redefining events as deficits below
thresholds, which
FIG. 3. (top) Brier scores B̂m,n (solid) for the ECMWF
forecasts,with bootstrapped 90% confidence intervals for Bm (dark
gray)and Bm,n (light gray; see appendix A) at each threshold.
ExpectedBrier scores are also shown for random forecasts (dotted)
andclimatology (dashed). (middle) The same as in the top panel
butfor the MF forecasts. (bottom) The difference (solid) in B̂m,n
be-tween the ECMWF and MF forecasts, with bootstrapped
90%confidence intervals for the differences between Bm (dark
gray)and Bm,n (light gray).
OCTOBER 2007 F E R R O 1081
-
does not alter the Brier score. We use N � 10 000datasets and r
� 1000 bootstrap samples throughout.
We show results for m � 8 and n � 40 only; resultsare
qualitatively similar for different values. Figure 4shows the lower
and upper coverage errors for the nor-mal and bootstrap confidence
intervals as � varies andwith different values for � and p. Figure
5 shows thecorresponding lengths. The coverage errors of thelower
limits are usually positive (the lower limits aretoo low and
overcover) while the upper errors are oftennegative (the upper
limits are too low and undercover).The errors are always smaller
for the bootstrap limitsthan for the normal limits. The bootstrap
achieves thisby shrinking the lower limits and extending the
upperlimits compared to the normal limits (not shown) tocapture
asymmetry in the sampling distribution of theBrier score, producing
wider intervals for the bootstrap,as revealed by Fig. 5. The
interval lengths decrease asboth � and � increase.
b. Serial dependence
To investigate the sensitivity of the results to thepresence of
serial dependence, the simulations were re-
peated with observations generated from a
first-ordermoving-average process with correlation 0.5 at lag
one.The standard errors were adjusted for the lag-one cor-relation
and the block bootstrap was employed withblocks of size two.
Results (not shown) were qualita-tively similar to those for serial
independence, exceptthat both positive and negative lower coverage
errorswere found. Errors were larger and intervals wider be-cause
of the smaller effective sample sizes.
c. Modified bootstrap intervals
The bootstrap coverage errors in Fig. 4 are typicallyless than
�/2. Errors decrease as n increases, so theseintervals may be
acceptable for many applications. Im-provements are desirable,
however, particularly forsmall sample sizes and rare events.
Several modifica-tions have been explored by the author,
specifically ba-sic and percentile bootstrap intervals and
bootstrapcalibration (Davison and Hinkley 1997, chapter 5) anda
continuity correction (Hall 1987) to account for thediscrete nature
of the summands of the Brier score.None of these methods improved
significantly on thestudentized intervals presented above. A
variance-
FIG. 4. Monte Carlo estimates of normal (solid) and bootstrap
(dashed) lower (left) andupper (right) coverage errors plotted
against � when m � 8; n � 40; � � 0 (thin), 0.4(medium), and 0.8
(thick); and p � (top) 0.9, (middle) 0.7, and (bottom) 0.5. Solid
horizontallines mark zero error.
1082 W E A T H E R A N D F O R E C A S T I N G VOLUME 22
-
stabilizing transformation proposed by DiCiccio et al.(2006) was
also applied and found to reduce the cov-erage error in the lower
limit, especially for rare eventsfor which errors were
approximately halved. The effecton the upper limits was small. A
large part of the cov-erage error in small samples arises from the
fact thatthe Brier score can take only a small number of
distinctvalues. One way to reduce these errors is to smooth
theBrier score by adding a small amount of random noise(Lahiri
1993). Investigations unreported here show thatthis can indeed
reduce coverage errors significantly atthe expense of widening the
confidence intervals. How-ever, results depend strongly on the
amount of smooth-ing employed and making general recommendations
isdifficult. An alternative solution could be to fit
jointprobability distributions to the observations and fore-casts
before determining the forecast probabilities(Bradley et al. 2003).
This would allow the forecasts,and hence the Brier score, to take
any values on theinterval [0, 1], and so avoid discretization
errors. An-other advantage would be the avoidance of intervalswith
zero length, which occurs for both the normal andbootstrap
intervals described above when all summandsof the Brier score are
equal.
5. Comparing Brier scores
Consider the task of comparing the Brier scores oftwo
forecasting systems, the first with ensemble size m
and the second with ensemble size m�, verified againstthe same
set of observations. Quantities pertaining tothe second system will
be distinguished with primes.We can compare the two systems by
constructing a con-fidence interval for the difference, BM � B�M,
betweenthe Brier scores that would be expected were both en-semble
sizes equal to M. Such a comparison may help toidentify whether or
not a perceived superiority of onesystem is due only to its larger
ensemble size. If the(1 � �) confidence interval contains zero,
then the nullhypothesis of equal scores is rejected in favor of
thetwo-sided alternative at the 100�% level. We estimatethe
difference between the Brier scores using unbiasedestimators [(8)],
though the subsampling approach de-scribed at the end of section 2d
could also be used.
Normal confidence intervals are defined by
�B̂M,n � B̂�M,n� � �M,nz,
where, if there is no serial dependence,
�M,n2 �
1n�n � 1� ���Wt � W�t� � �B̂M,n � B̂�M,n��
2.
As in section 3, this can be adjusted to account for
serialdependence. Bootstrap intervals approximate the dis-tribution
of
Dn � ��B̂M,n � B̂�M,n� � �BM � B�M���M,n
with a bootstrap sample {Dn*i : i � 1, . . . , r}. If the
summands are serially independent, then this samplecan be formed
by repeating the following steps for eachi � 1, . . . , r.
1) Resample pairs (W*s , W�*s ) uniformly and with re-placement
from {(Wt, W�t ) : t � 1, . . . , n} for each s� 1, . . . , n.
2) Compute B̂*M,n, B̂�*M,n, and �*M,n for the resampleddata.
3) Set Dn*i � [(B̂*M,n � B̂�*M,n) � (B̂M,n � B̂�M,n)]/�*M,n.
The first step preserves dependence between con-temporary
summands of the two scores. Block boot-strapping may again be
employed if the summands areserially dependent, and confidence
limits take the form
�B̂M,n � B̂�M,n� � �M,nDn*�l�
and
�B̂M,n � B̂�M,n� � �M,nDn*�k�.
Bootstrapped 90% confidence intervals for the dif-ference
between Bm for the ECMWF and MF forecastsare illustrated by the
dark gray bands in Fig. 3 (bottompanel). The scores are
significantly different at the 10%level between only the 0.3- and
0.4-quantile thresholds(50–70 mm).
FIG. 5. Monte Carlo estimates of normal (solid) and
bootstrap(dashed) interval lengths plotted against � when m � 8; n
� 40;� � 0 (thin), 0.4 (medium), and 0.8 (thick); and p � (top)
0.9,(middle) 0.7, and (bottom) 0.5.
OCTOBER 2007 F E R R O 1083
-
The statistical significance of the differences betweenBrier
scores can also be quantified using hypothesistests. The powers of
four such tests are investigated inappendix B, where the
permutation test is found to bean attractive alternative to the
bootstrap test presentedabove. The permutation test yields similar
results forour data, however, with p values less than 0.1
betweenonly the 0.3- and 0.4-quantile thresholds.
6. Multiple comparisons
We have so far constructed confidence intervalsseparately for
each threshold u. These intervals are de-signed to contain the
quantity of interest, such as anexpected score or the difference
between two expectedscores, with a certain probability at each
individualthreshold. We may wish, however, to construct confi-dence
intervals such that the quantity of interest is con-tained
simultaneously within the intervals at all thresh-olds with a
certain probability. We describe how toconstruct such confidence
intervals in this section.
Denote by B(u) the quantity of interest at thresholdu and
suppose that we want to consider a collection S ofthresholds. We
aim to find confidence limits L(u) andU(u) for each u ∈ S such
that
Pr{L�u� � B�u� � U�u� for all u∈S } � 1 � 2.
�10�
If we used the (1 � 2�) confidence limits proposed inprevious
sections for L(u) and U(u), then this probabil-ity would be less
than 1 � 2�. For example, if scoreswere independent between
thresholds, then the prob-ability would be (1 � 2�)|S |, where |S |
is the number ofthresholds.
Simultaneous confidence limits can be obtained usinga bootstrap
method described by Davison and Hinkley(1997, section 4.2.4).
Suppose that equitailed confi-dence intervals at each threshold u
are based on a boot-strap sample {T*i(u) : i � 1, . . . , r} and
have the form
L�u� � B̂�u� � �̂�u�T*�r�1�k��u�,
U�u� � B̂�u� � �̂�u�T*�k��u�
for some 1 � k � r/2. If we use these limits to formsimultaneous
intervals, then the bootstrap estimate ofthe coverage probability
[(10)] is
1r �i�1
r
I�T*�k��u� � T*i�u� � T*�r�1�k��u� for all u ∈ S �.
It is sufficient, therefore, to choose k such that thisestimate
is as close as possible to 1 � 2�.
The resampling must preserve dependence between
thresholds: the statistics {T*i(u) : u ∈ S } should be com-puted
from the same data for each i. So, resamplingschemes take the
following form.
1) Resample (X*s , Y*s,1, . . . , Y*s,m) from {(Xt, Yt,1, . . .
,Yt,m) : t � 1, . . . , n} for each s � 1, . . . , n.
2) Compute T*i(u) for all u ∈ S.The size of the resample may
also need to be larger
for simultaneous intervals. If scores were independentacross
thresholds, the worst case, then the bootstrapestimate of the
coverage would be approximately (1 �2k/r)|S |. If this is to equal
1 � 2�, then we require r �2k/[1 � (1 � 2�)1/|S |] � k|S | /� for
large |S |. If � � 0.05and we want k � 5 for example, then r �
100|S |.
The dark gray bands in Fig. 6 are bootstrapped, si-multaneous
90% confidence intervals for Bm for theECMWF and MF rainfall
forecasts. Considering allthresholds together, then, we find that,
at the 10%level, neither the ECMWF nor MF forecasts differ
sig-nificantly from climatology. The evidence for a differ-ence
between the ECMWF and MF forecasts is mar-ginal at the 10%
level.
7. Discussion
This article identified the effect of ensemble size onthe
expected Brier score [(7)] and, given ensembles of
FIG. 6. (top) Brier scores B̂m,n (solid) for the ECMWF
forecasts,with bootstrapped simultaneous 90% confidence intervals
for Bm(dark gray) and Bm,n (light gray) at each threshold.
ExpectedBrier scores are also shown for random forecasts (dotted)
andclimatology (dashed). (middle) As in the top panel but for the
MFforecasts. (bottom) The difference (solid) in B̂m,n between
theECMWF and MF forecasts, with bootstrapped simultaneous
90%confidence intervals for the differences between Bm (dark
gray)and Bm,n (light gray).
1084 W E A T H E R A N D F O R E C A S T I N G VOLUME 22
-
size m, an unbiased estimator [(8)] for the expectedBrier score
that would be obtained for any other en-semble size. We assumed
that ensemble members wereexchangeable, an acceptable assumption
when the fore-cast lead time is long enough for systematic
differencesbetween members to be lost. We proposed standarderrors
and confidence intervals for the expected Brierscores and found
that bootstrap intervals performedwell in a simulation study. When
comparing the Brierscores from two forecasting systems, we proposed
com-paring estimates of expected Brier scores that would beobtained
were the ensemble sizes equal, and describedconfidence intervals
for their difference. We showedthat if the Brier scores for several
event definitions areof interest, then it is possible to construct
confidenceintervals that simultaneously contain with a
specifiedprobability the expected scores for all events.
We applied our methods to two sets of rainfall fore-casts. For
forecasting low rainfall totals, MF forecastshad lower Brier scores
than ECMWF forecasts, evenafter estimating the effect of increasing
the ECMWFensemble size to infinity. Standard errors and confi-dence
intervals suggested that the scores were onlymarginally
significantly different at the 10% level for afew thresholds, and
neither set of forecasts performedbetter than forecasting
climatology.
Müller et al. (2005) have aims similar to ours but forthe more
general quadratic ranked probability score(RPS). They note that the
expected RPS for perfectlycalibrated but random ensemble forecasts
exceeds theRPS obtained by forecasting climatology, which
isequivalent to a perfectly calibrated random ensembleforecast with
infinite ensemble size. This is analogousto Bm exceeding B. Instead
of using climatology as thereference forecast in RPS skill scores,
they thereforepropose using a perfectly calibrated random
ensembleforecast with an ensemble size equal to that of the
fore-casts being assessed. This is equivalent to our proposalof
comparing B̂m,n with Bm instead of B.
Müller et al. (2005) also produce confidence bandsrepresenting
the sampling variation in the RPS skillscore for random forecasts
that arises among differentobservation–forecast datasets. Comparing
a forecastsystem’s skill score with these bands provides a guide
toits statistical significance relative to a random forecast,but
does not provide a formal statistical test becausethe sampling
variation in the system’s skill score is ig-nored. Our confidence
intervals differ substantially:they are confidence intervals for
the expected score ofthe forecast system being employed and can,
therefore,be used to make comparisons with the expected scoreof any
reference forecast, not only random forecasts,
and can also be used to compare the expected scores oftwo
forecasting system.
The methods presented in this article can be ex-tended in
several ways. We have defined events as ex-ceedances of thresholds
for simplicity, but the samemethods could be applied for events
defined by mem-bership of general sets. We have also considered
scalarobservations and forecasts for simplicity, but multivari-ate
data can be handled with the same methods; forexample, events could
be defined by membership ofmultidimensional sets. The methods
presented here canalso be extended to multiple-category Brier
scores(Brier 1950) and to the RPS. Computer code for theprocedures
presented in this article and written in thestatistical programming
language R is available fromthe author.
Acknowledgments. Conversations with Professor I.T. Jolliffe and
Drs. C. A. S. Coelho, D. B. Stephenson,and G. J. van Oldenborgh
(who provided the data),plus comments from the referees, helped to
motivateand improve this work.
APPENDIX A
Conditional Brier Scores
a. Unbiased estimators
We discussed the expected Brier score [(1)] in themain part of
the paper, where the expectation wastaken over repeated sampling of
forecasts and observa-tions. We investigate the conditional
expected Brierscore in this appendix, where the expectation is
takenover repeated sampling of forecasts, but where the
ob-servations remain fixed. This quantity would be of in-terest if
we wanted to know how a forecasting systemwould have performed on a
particular set of target ob-servations for different ensemble
sizes. As before, weshall find the effects of ensemble size and
constructunbiased estimators, standard errors, and
confidenceintervals.
The only source of variation in the conditional case isthe
generation of ensemble members: each observationXt, and the
corresponding model details Zt, are fixed.Consequently, we no
longer need to assume stationar-ity, and the conditional expected
Brier score is
Bm,n �1n �E ��Q̂t � It�
2 |Xt, Zt�
�1n ��E�Q̂t
2 |Zt� � 2E�Q̂t |Zt�It � It�
since Xt determines It. To see the effects of ensemblesize, we
assume again that the forecasts Q̂t are simple
OCTOBER 2007 F E R R O 1085
-
proportions [(3)] and that ensemble members are ex-changeable.
Then, for each t,
E�Q̂t |Zt� � Qt
and
E�Q̂t2 |Zt� �
Qtm
�m � 1
mRt
for some Qt and Rt independent of m, and
Bm,n �1n ��Rt � 2ItQt � It � 1m �Qt � Rt��
� B�,n �1
mn ��Qt � Rt�,
where B,n is the conditional counterpart of B. Theadjusted Brier
score [(8)] is again unbiased for BM,n.
b. Standard errors
Estimating the uncertainty about BM,n due to sam-pling variation
is harder than for BM because we nolonger assume stationarity. The
contribution to thesampling variation must therefore be
quantifiable foreach ensemble separately. This is easier if
westrengthen our assumption of exchangeability to one ofindependent
and identically distributed members. Thisassumption is difficult to
test empirically for ensembleforecasts with a distribution that
changes through timeand requires further investigation (cf. Bergsma
2004) sowe appeal to the long lead time of our rainfall
forecastsfor justification. In this case, the number Kt of
membersthat forecast the event in the ensemble at time t has
abinomial distribution with mean mQt and variancemQt(1 � Qt). After
some lengthy algebra, the condi-tional variance, �2M,n, of B̂M,n
can be shown to satisfy
MmnM,n2
M � 1�
2�3 � 2m��1 � 1M�n�m � 1� �Qt
4 �
4�m � 2 � 3 � 2mM �n�m � 1� �Qt
3 �
2�1 � 2m � 3M � 1 � 7 � 5m2M�M � 1��n�m � 1� �Qt
2
�1
n�m � 1�M �Qt �8n �It Qt
3 �12n �ItQt2 � 4n �ItQt .
This variance decreases as m�1 for large m, so B̂M,n
isconsistent for BM,n as m → . An unbiased estimatorfor �2M,n can
be constructed if m � 3 by replacing eachQst in the previous
equation with
Kt�Kt � 1�. . .�Kt � s � 1�m�m � 1�. . .�m � s � 1�
for positive integers s. The square root, �̃M,n, of thisunbiased
estimator is then an estimator for the condi-tional standard error
of B̂M,n. If m � 3, then we canreplace Qst with (Kt/m)
s instead, but note that if m � 1,then �̃M,n is always zero.
Estimates of these conditional standard errors areshown in Fig.
2 for the ECMWF and MF rainfall fore-casts. As expected, they are
smaller than their uncon-ditional counterparts, which reflect the
additionalvariation from sampling observations. In fact, the
con-ditional standard errors are small enough to suggestthat the
superiority of the MF forecasts at thresholdsbelow 90 mm is
statistically significant and would re-main for these particular
observations even if differentensemble members were sampled.
c. Confidence intervals
A normal (1 � 2�) confidence interval for BM,n is
B̂M,n � ̃M,nz.
Bootstrap intervals approximate the distribution of
TM,n � �B̂M,n � BM,n�
̃M,n
by a bootstrap sample {T*iM,n : i � 1, . . . , r}. This
samplecan be formed by repeating the following steps for eachi � 1,
. . . , r.
1) Resample Y*t,j from {Yt,i : i � 1, . . . , m} for each j �1,
. . . , m, and repeat for each t � 1, . . . , n.
2) Form B̂*M,n and �̃*M,n from these resampled en-sembles in the
same way that the original ensembleswere used to form B̂M,n and
�̃M,n.
3) Set TM,n*i � (B̂*M,n � B*M,n)/�̃*M,n, where
B*M,n �1n ��Q̂t2 � 2ItQ̂t � It � 1MQ̂t�1 � Q̂t�� .
Bootstrap (1 � 2�) confidence limits then take theform
B̂M,n � ̃M,nT*(l)
M,n and B̂M,n � ̃M,nT*(k)
M,n .
Bootstrapped 90% confidence intervals for Bm,n areillustrated
for the ECMWF and MF forecasts in Fig. 3.Again, the intervals are
narrower than those for Bm.The ECMWF forecasts are now
significantly worsethan climatology for many thresholds, that is,
they areunlikely to do as well as climatology for these
observa-
1086 W E A T H E R A N D F O R E C A S T I N G VOLUME 22
-
tions were new ensemble members to be sampled, butare
significantly better than random forecasts exceptbetween the 30%
and 50% quantiles (50–90 mm). TheMF forecasts are not significantly
different than clima-tology at most thresholds, and are
significantly betterthan random forecasts at all thresholds.
d. Simulation study
The simulation study of section 4 was repeated forBm,n. Results
are not shown but were qualitatively simi-lar to those reported in
section 4 for Bm except for rareevents (p � 0.9). In that case,
bootstrap intervals re-main preferable to normal intervals except
when � � 0,for which both intervals have large coverage errors.
e. Comparing Brier scores
Confidence intervals for the difference, BM,n � B�M,n,between
the conditional expected Brier scores of twosystems are easy to
construct if the forecasts from thetwo systems at any time t can be
considered indepen-dent once the model details Zt and Z�t are
fixed. Thisassumption might be violated if the ensemble genera-tion
process causes pairing of members between thetwo systems, though
any such dependence is likely todiminish with lead time. The
distribution of
DM,n � ��B̂M,n � B̂�M,n� � �BM,n � B�M,n��
M,n,
where �2M,n � �̃2M,n � �̃�
2M,n, can be approximated by a
bootstrap sample of the quantity
D*M,n � ��B̂*M,n � B̂�*M,n� � �B*M,n � B�*M,n��
*M,n
to obtain confidence limits
�B̂M,n � B̂�M,n� � M,nD*(l)
M,n
and
�B̂M,n � B̂�M,n� � M,nD*(k)
M,n .
Resampling follows the scheme described earlier in thesection
independently for each system.
Figure 3 (bottom panel) shows bootstrapped 90%confidence
intervals for the difference between Bm,n forthe ECMWF and MF
forecasts. The MF score is sig-nificantly lower than the ECMWF
score for mostthresholds below 90 mm.
APPENDIX B
Hypothesis Tests
We used confidence intervals in section 5 to test nullhypotheses
of equal expected Brier scores. Using thenormal confidence interval
is equivalent to a z test [e.g.,the test labeled S1 by Diebold and
Mariano (1995)] and
using the bootstrap interval is equivalent to a bootstraptest
(e.g., Davison and Hinkley 1997, p. 171). Confi-dence intervals are
useful for quantifying uncertaintyeven when no comparative test is
attempted, but if com-parison is the goal, then other test
procedures mightalso be employed. Hypothesis tests such as the sign
andsigned-rank tests (Hamill 1999) test for differences be-tween
medians and are inappropriate for testing differ-ences between
Brier scores, which are sample means.Instead, we compare the powers
of the z and bootstraptests with those of a t test and a
permutation test(Hamill 1999) in a simulation study.
The study design is similar to that in section 4, exceptthat two
sets of forecasts are simulated, one uncorre-lated with the
observations (�1 � 0) while the correla-tion for the other set is
varied from �2 � 0 to �2 � 1. Thesets have the same expected Brier
score when �2 � 0and the scores diverge as �2 increases. For each
value of�2, 10 000 datasets are generated and subjected to thefour
tests at the 10% significance level. Figure B1 (leftpanel) shows
Monte Carlo estimates of power for thefour tests when m � 8, n �
20, and p � 0.5. All fourtests have similar powers, although the z
test is slightlyoversized and the bootstrap test has slightly
lowerpower far from the null hypothesis.
The z test in Diebold and Mariano (1995) is adaptedto handle
serial dependence, and block resampling canbe used for the
permutation and bootstrap tests. Thepower study is repeated with
observations simulatedfrom a first-order moving-average process
with corre-lation 0.5 at lag one. Powers for these three tests
areplotted in Fig. B1 (right panel) and show that the z testand, to
a lesser extent, the bootstrap tests are oversized,
FIG. B1. Monte Carlo estimates of the powers of the
bootstrap(solid), permutation (dashed), and z (dotted) and t tests
(dotted–dashed) against correlation �2 (see text) for (left)
serially inde-pendent and (right) dependent observations.
OCTOBER 2007 F E R R O 1087
-
while the permutation test has remained well sized andits power
has reduced only slightly from the indepen-dent case. From this
limited study, the permutation testappears to be an attractive
alternative to the bootstraptest for differences between Brier
scores.
REFERENCES
Bergsma, W. P., 2004: Testing conditional independence for
con-tinuous random variables. EURANDOM Tech. Rep. 2004–048, 19
pp.
Bradley, A. A., T. Hashino, and S. S. Schwartz, 2003:
Distribu-tions-oriented verification of probability forecasts for
smalldata samples. Wea. Forecasting, 18, 903–917.
Brier, G. W., 1950: Verification of forecasts expressed in terms
ofprobability. Mon. Wea. Rev., 78, 1–3.
Briggs, W., and D. Ruppert, 2005: Assessing the skill of
yes/nopredictions. Biometrics, 61, 799–807.
Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size
onensemble prediction. Mon. Wea. Rev., 126, 2503–2518.
Chatfield, C., 2004: The Analysis of Time Series: An
Introduction.Chapman and Hall, 333 pp.
Davison, A. C., and D. V. Hinkley, 1997: Bootstrap Methods
andTheir Application. Cambridge University Press, 592 pp.
DiCiccio, T. J., A. C. Monti, and G. A. Young, 2006:
Variancestabilization for a scalar parameter. J. Roy. Stat. Soc.,
68B,281–303.
Diebold, F. X., and R. S. Mariano, 1995: Comparing
predictiveaccuracy. J. Bus. Econ. Stat., 13, 253–263.
Hall, P., 1987: On the bootstrap and continuity correction. J.
Roy.Stat. Soc., 49B, 82–89.
Hamill, T. M., 1999: Hypothesis tests for evaluating
numericalprecipitation forecasts. Wea. Forecasting, 14,
155–167.
Jolliffe, I. T., 2007: Uncertainty and inference for
verificationmeasures. Wea. Forecasting, 22, 637–650.
——, and D. B. Stephenson, 2003: Forecast Verification: A
Prac-titioner’s Guide in Atmospheric Science. John Wiley and
Sons,240 pp.
Kane, T. L., and B. G. Brown, 2000: Confidence intervals forsome
verification measures–a survey of several methods. Pre-prints, 15th
Conf. on Probability and Statistics, Asheville, NC,Amer. Meteor.
Soc., 46–49.
Lahiri, S. N., 1993: Bootstrapping the Studentized sample mean
oflattice variables. J. Mult. Anal., 45, 247–256.
Müller, W. A., C. Appenzeller, F. J. Doblas-Reyes, and M.
A.Liniger, 2005: A debiased ranked probability skill score
toevaluate probabilistic ensemble forecasts with small en-semble
sizes. J. Climate, 18, 1513–1523.
Palmer, T. N., and Coauthors, 2004: Development of a
EuropeanMultimodel Ensemble System for Seasonal to
InterannualPrediction (DEMETER). Bull. Amer. Meteor. Soc., 85,
853–872.
Potts, J. M., 2003: Basic concepts. Forecast Verification: A
Practi-tioner’s Guide in Atmospheric Science, I. T. Jolliffe and D.
B.Stephenson, Eds., John Wiley and Sons, 13–36.
Richardson, D. S., 2001: Measures of skill and value of
ensembleprediction systems, their interrelationship and the effect
ofensemble size. Quart. J. Roy. Meteor. Soc., 127, 2473–2489.
Romano, J. P., 1988: A bootstrap revival of some
nonparametricdistance tests. J. Amer. Stat. Assoc., 83,
698–708.
Seaman, R., I. Mason, and F. Woodcock, 1996: Confidence
inter-vals for some performance measures of yes-no forecasts.Aust.
Meteor. Mag., 45, 49–53.
Severini, T. A., 2005: Elements of Distribution Theory.
CambridgeUniversity Press, 515 pp.
Stephenson, D. B., 2000: Use of the “odds ratio” for
diagnosingforecast skill. Wea. Forecasting, 15, 221–232.
Thornes, J. E., and D. B. Stephenson, 2001: How to judge
thequality and value of weather forecast products. Meteor.Appl., 8,
307–314.
Wilks, D. S., 1997: Resampling hypothesis tests for
autocorrelatedfields. J. Climate, 10, 65–82.
——, 2006: Statistical Methods in the Atmospheric Sciences. 2d
ed.Academic Press, 627 pp.
Woodcock, F., 1976: The evaluation of yes/no forecasts for
scien-tific and administrative purposes. Mon. Wea. Rev., 104,
1209–1214.
1088 W E A T H E R A N D F O R E C A S T I N G VOLUME 22