Comparing Probabilistic Forecasting Systems with the Brier ...empslocal.ex.ac.uk/people/staff/ferro/Publications/ferro...Comparing Probabilistic Forecasting Systems with the Brier

Comparing Probabilistic Forecasting Systems with the Brier Score

CHRISTOPHER A. T. FERRO

School of Engineering, Computing and Mathematics, University of Exeter, Exeter, United Kingdom

(Manuscript received 13 April 2006, in final form 16 January 2007)

ABSTRACT

This article considers the Brier score for verifying ensemble-based probabilistic forecasts of binary events.New estimators for the effect of ensemble size on the expected Brier score, and associated confidenceintervals, are proposed. An example with precipitation forecasts illustrates how these estimates supportcomparisons of the performances of competing forecasting systems with possibly different ensemble sizes.

1. Introduction

Verification scores are commonly used as numericalsummaries for the quality of weather forecasts. Generalintroductions to forecast verification are given by Jol-liffe and Stephenson (2003) and Wilks (2006, chapter7). There are many situations in which we may wish tocompare the values of a verification score computed fortwo sets of forecasts: to compare the quality of forecastsfrom a single forecasting system at different times orlocations, or in different meteorological conditions, orto compare the quality of forecasts from two forecast-ing systems. Several authors have recommended thatmeasures of uncertainty for the scores, such as standarderrors or confidence intervals, should be computed toaid such comparisons. Woodcock (1976), Seaman et al.(1996), Kane and Brown (2000), Stephenson (2000),Thornes and Stephenson (2001), and Wilks (2006, sec-tion 7.9) propose confidence intervals for scores of de-terministic binary-event forecasts. Bradley et al. (2003)use simulation to compare the biases and standard er-rors of different estimators for several scores of proba-bilistic binary-event forecasts, but do not discuss esti-mators for the standard errors. Hamill (1999) takes adifferent approach and proposes hypothesis tests forcomparing the scores of two sets of deterministic orprobabilistic forecasts; see also Briggs and Ruppert(2005). Jolliffe (2007) reviews this work and related

ideas, and also presents confidence intervals for thecorrelation coefficient.

Woodcock (1976) explains the motivation for theseattempts to quantify uncertainty. The value of a scoredepends on the choice of target observations, so thesuperiority of one forecasting system over another asgauged by their verification scores computed for onlyfinite samples of forecasts and observations cannot bedefinitive: the superiority may be reduced or even re-versed were the systems applied to new target observa-tions. The methods listed previously estimate the varia-tion that would arise in the value of a verification scorewere forecasts made for different sets of potential tar-get observations, and thereby quantify the uncertaintyabout some “true” value that would be known wereforecasts available for all potential observations. Weshall consider uncertainty in expected values of theBrier verification score (Brier 1950) in the case of en-semble-based probabilistic binary-event forecasts. Un-biased estimators for the expected Brier score thatwould be obtained for any ensemble size are defined insection 2. Standard errors and confidence intervals aredeveloped in section 3, and their performance is as-sessed with a simulation study in section 4. Methods forcomparing the Brier scores of two forecasting systemsare presented in section 5. Confidence intervals appro-priate for comparing Brier scores of two systems simul-taneously for multiple events and sites are described insection 6.

The methods are illustrated throughout the paperwith seasonal precipitation forecasts from the Develop-ment of a European Multimodel Ensemble System forSeasonal to Interannual Prediction (DEMETER) project

Corresponding author address: C. Ferro, School of Engineering,Computing and Mathematics, University of Exeter, HarrisonBldg., North Park Rd., Exeter EX4 4QF, United Kingdom.E-mail: [email protected]

1076 W E A T H E R A N D F O R E C A S T I N G VOLUME 22

DOI: 10.1175/WAF1034.1

© 2007 American Meteorological Society

WAF1034

(Palmer et al. 2004). In particular, 3-month-ahead,nine-member ensemble forecasts of total October pre-cipitation produced by the European Centre forMedium-Range Weather Forecasts (ECMWF) andMétéo-France (MF) global atmosphere–ocean coupledmodels are compared with observations recorded atJakarta (6.17°S, 106.82°E) for the years 1958–95. Dataare missing for 1983, leaving 37 yr. The ensembles weregenerated by sampling independent perturbations ofthe initial ocean state. The Jakarta observations and theforecasts from the corresponding grid box are shown inFig. 1.

2. Expected Brier scores

a. Definitions

We define the Brier score together with notationthat will be used throughout the rest of the paper. Let{Xt : t � 1, . . . , n} be a set of n observations, and let{(Yt,1, . . . , Yt,m) : t � 1, . . . , n} be a corresponding setof m-member ensemble forecasts. For each time t, sup-pose that we issue a probabilistic forecast, Q̂t, for theevent that observation Xt exceeds a threshold u. TheBrier score for this set of forecasts, equal to one-half ofthe score originally proposed by Brier (1950), is themean squared difference between the forecasts and theindicator variables for the event; that is,

B̂m,n �1n �t�1

n

�Q̂t � It�2,

where It � I(Xt � u), I(A) � 1 if A is true, and I(A) �0 if A is false. All summations will be over t � 1, . . . , nunless otherwise specified.

We mentioned in the previous section that the varia-tion in verification scores caused by the choice of targetobservations leads to uncertainty about the quality ofthe forecasting system. One quantity of interest that wemay be uncertain about is the expected Brier score,denoted

Bm � E�B̂m,n�, �1�

and defined as the average Brier score over repeatedsamples from a population of observations and fore-casts. This population can be defined implicitly by as-suming that the available sample of observations andforecasts is in some sense representative of the largerpopulation. We assume that the population is a station-ary sequence of which our data form a partial realiza-tion. This is likely to be a good approximation in astable climate and could be generalized for a changingclimate by assuming, for example, that the data are apartial realization of a nonstationary, multivariate timeseries model chosen to represent climatic change. Weshall concentrate on Bm, but other quantities, such asthe conditional expected Brier score discussed in ap-pendix A, may also be of interest.

b. The effects of ensemble size

We investigate how Bm depends on the ensemble sizem. By stationarity, the expectation of (Q̂t � It)

2 is thesame for all t, so we can write

Bm � E ��Q̂ � I�2�,

where Q̂ and I are the forecast and the event indicatorfor an arbitrary time. This expectation is an averageover all possible values of the observation variable Xand the ensemble Y � (Y1, . . . , Ym). Now let Z denotesufficient information about the forecasting model todetermine a probability distribution for Y, given whichY is independent of X. This information might be themodel specification plus a probability distribution forits initial conditions, for example. The law of total ex-pectation (e.g., Severini 2005, p. 55) says that we canobtain Bm in two stages: first by taking the expectationwith respect to X and Y when Z is held fixed, and thenaveraging over Z. This is written

Bm � EE ��Q̂ � I�2 |Z�.

FIG. 1. Observed (vertical lines) October rainfall (mm) inJakarta from 1958 to 1995 plotted between both the ECMWF(filled circles) and MF (open circles) nine-member ensemble fore-casts.

OCTOBER 2007 F E R R O 1077

The interior, conditional expectation is

E��Q̂ � I�2 |Z� � E�Q̂2 |Z� � 2E�Q̂I |Z� � E�I |Z�

� E�Q̂2 |Z� � 2E�Q̂ |Z�P � P, �2�

where P � E(I |Z) � Pr(X � u |Z) is the probabilitywith which the event occurs.

We must specify how Q̂ is formed from the ensemblemembers in order to reveal the effects of ensemble size.We choose forecasts equal to the proportion of mem-bers that exceed a threshold �, possibly different fromu; that is,

Q̂ �K

m�

1m �i�1

m

I�Yi � ��, �3�

where K is the number of members exceeding �. Alter-native forecasts could be considered, for example, (K �a)/(m � b) with b � a � 0, although these lead to morecomplicated formulas later.

For simplicity, we assume that the members withinany particular ensemble are exchangeable. Exchange-ability means that the members are indistinguishable bytheir statistical properties: their joint distribution func-tion is invariant to relabeling the members. This admitshomogeneous dependence between members and in-cludes the special case of independent and identicallydistributed members.

Exchangeability implies that all members of an en-semble exceed � with the same probability,

Pr�Yi � � |Z� � Q for all i,

and all pairs of members jointly exceed � with the sameprobability,

Pr�Yi � �, Yj � � |Z� � R for all i � j.

Taken together with the forecast definition [(3)], wehave

E�Q̂ |Z� �1m �i�1

m

Pr�Yi � � |Z� � Q, �4�

E�Q̂2 |Z� �1

m2 �i�1m

�j�1

m

Pr�Yi

� �, Yj

� � |Z�

�Q

m�

m � 1m

R, �5�

and the conditional expectation [(2)] equals

R � 2PQ � P �1m

�Q � R�.

Finally, we take the expectation with respect to Z toobtain

Bm � E�R� � 2E�PQ� � E�P� �1m

E�Q � R�.

�6�

Because P, Q, and R are independent of m, this expres-sion describes completely the effects of ensemble size.Moreover, the final term is non-negative because R �Q. As the ensemble size increases, Bm therefore de-creases monotonically to the expected Brier score, B,that would be obtained for an infinite ensemble size,and we can write

BM � B� �1M

E�Q � R�, �7�

where BM is the expected Brier score that would beobtained for an ensemble of size M. This generalizesthe relationship found by Richardson [2001, Eq. (9)] forindependent ensemble members, in which case R � Q2.

c. Unbiased estimators

The Brier score B̂m,n is an unbiased estimator for Bmby definition [(1)] but is biased for BM when M � m.Estimating BM from ensembles of size m is useful forcomparing forecasting systems with different ensemblesizes or for assessing the potential benefit of larger en-sembles (cf. Buizza and Palmer 1998). Equations (4)and (5) can be used to show that an unbiased estimatorfor BM is

B̂M,n � B̂m,n �M � m

M�m � 1�n �Q̂t�1 � Q̂t�, �8�and letting M → yields an unbiased estimator for B.

1) REMARK 1

The new estimator [(8)] is undefined if m � 1, inwhich case an unbiased estimator for BM (M � 1) doesnot exist because the forecasts contain no informationabout the effects of ensemble size. Mathematically, forany function h(K, I) independent of R,

E�h�K, I� |Z� � h�0, I��1 � Q� � h�1, I�Q

cannot contain the required R term. Richardson (2001)does, however, develop a method for estimating a skillscore based on BM given an ensemble of any size, evenm � 1. He achieves this by assuming independent en-semble members (R � Q2) and perfect reliability (Q �P) in which case the expression [(6)] for Bm becomes

Bm �m � 1

mE �Q�1 � Q��,

and so


BM �m�M � 1�M�m � 1�

Bm. �9�

In this special case, an unbiased estimator can thereforebe obtained for BM even when m � 1 by substitutingB̂m,n for Bm in the right-hand side of (9).

2) REMARK 2

The adjustment term in the definition [(8)] of B̂M,ndepends on only the forecasts and is a measure ofsharpness (e.g., Potts 2003). Let

S �1n ��Q̂t � 12�2

be the sample variance of the forecasts around one-half: as S decreases from its maximum value (1/4) to itsminimum value (0), forecasts become more concen-trated around one-half and the sharpness decreases.Now,

B̂M,n � B̂m,n �M � m

M�m � 1� �14 � S�,so B̂M,n reduces the Brier score by amounts that dependon the estimated sharpness and the ensemble size, m.For fixed sharpness, the improvement in forecast qual-ity from increasing the ensemble size decreases as mincreases: the law of diminishing returns. For fixed m,the improvement decreases as the sharpness increases,suggesting that the improvement may be attributed tothe opportunity to shift forecasts slightly farther awayfrom one-half.

3) REMARK 3

The Brier score B̂m,n is proper (e.g., Wilks 2006, p.298) because, if our belief in the occurrence of the event{It � 1} equals p ∈ [0, 1], then the expected contributionto the Brier score with respect to this belief from issuingforecast Q̂t, that is,

E��Q̂t � It�2� � Q̂t

2�1 � p� � �Q̂t � 1�2p,

is minimized by choosing Q̂t � p. Similar calculationsshow that B̂M,n is improper when M � m because theoptimum forecast is then

p �m � M

m�M � 1� �12 � p�.

Therefore, B̂M,n should not be used in situations whereit could be hedged.

d. Exchangeability

We assumed that ensemble members were exchange-able and independent of observations given suitable in-formation, Z. The latter assumption is hard to contestbecause Z can include the full specification of the fore-casting model and its inputs. Exchangeability is morerestrictive and would be violated were one member bi-ased relative to the other members, for example, orwere one pair of members more strongly correlatedthan other pairs.

Exchangeability might be justified by the processgenerating the ensemble. For example, exchangeabilitywill hold if the initial conditions for the members arerandomly sampled from a probability distribution. Ex-changeability is also likely to hold if the forecast leadtime is long enough for any initial ordering or depen-dence between the members to be lost. This latter ar-gument seems appropriate for our 3-month-ahead rain-fall forecasts.

Exchangeability might also be justified by empiricalassessment. Romano (1988) describes a bootstrap testfor exchangeability based on the maximum distance be-tween the empirical distribution functions of the mem-bers and permuted members. Applying this test for theECMWF and MF ensemble forecasts of Jakarta rainfallgave p values of 0.26 and 0.24, which is only weak evi-dence for rejecting exchangeability.

The effect of ensemble size on the expected Brierscore can still be estimated even when exchangeabilityis unjustifiable. If we wish to estimate BM for a subset ofM � m members, then an unbiased estimator is simplythe Brier score evaluated for the forecasts constructedfrom those M members. This approach is straightfor-ward to implement for any verification score, but isinapplicable when M � m.

e. Data example

We estimate the expected Brier scores Bm and B forthe ECMWF and MF rainfall forecasts at a range ofevent thresholds u and �. These are shown in Fig. 2,where we set � � u and let u range from the 10% to the90% quantiles of the observed rainfall. The MF fore-casts appear to have significantly lower Brier scoresthan do those of the ECMWF for thresholds below 90mm (about the median observed rainfall), and the twosystems have similar Brier scores at higher thresholds.The estimated difference between Bm for MF and Bfor ECMWF is also large below 90 mm, suggesting thatincreasing the ECMWF ensemble size would not besufficient to match the MF Brier score.


3. Sampling variation

a. Standard errors

Point estimates of expected Brier scores were pre-sented in the previous section. In this section, we esti-mate the uncertainty associated with these estimatesdue to sampling variation. In particular, we shall esti-mate standard errors and construct confidence intervalsfor the expected scores. We assume only that the dataare stationary; we no longer need to assume exchange-ability or a particular form [(3)] for the forecasts. Weshall consider only B̂M,n, the estimator [(8)] for BM,because other estimators can be obtained as specialcases by changing M.

We can write B̂M,n as the sample mean of the sum-mands

Wt � �Q̂t � It�2 �

M � m

M�m � 1�Q̂t�1 � Q̂t�.

If the interval between successive times t is large, thenwe may be justified in making the further assumptionthat these summands are independent, in which casethe standard error of B̂M,n is estimated by

�̂M,n ��Wt � B̂M,n�2n�n � 1� .

If the summands are dependent, then estimates of serialcorrelation may be incorporated into the standard error(e.g., Wilks 1997).

There is little evidence for serial dependence in thesummands of the Brier scores for our ECMWF and MFrainfall forecasts. For example, a two-sided test for thelag-one autocorrelation (e.g., Chatfield 2004, p. 56) wasconducted for both the ECMWF and MF data at eachof nine thresholds � � u ranging from the 10% to the90% quantiles of the observed rainfall, and only one pvalue was smaller than 0.1. We assume serial indepen-dence hereafter. Standard errors for the estimates ofBm and B are shown in Fig. 2 and are large enough tocall into question the statistical significance of the dif-ferences noted previously in the quality of the ECMWFand MF forecasts. These differences are assessed moreformally in section 5.

b. Confidence intervals

More informative descriptions of uncertainty are af-forded by confidence intervals, which we now con-struct. Unless the summands of B̂M,n exhibit long-rangedependence, we can expect a central limit theorem tohold and imply that B̂M,n is approximately normallydistributed when n is large. An approximate (1 � 2�)confidence interval for BM would then be

B̂M,n � �̂M, nz,

where z� is the � quantile of the standard normal dis-tribution.

An alternative approximation to the distribution ofB̂M,n is available via the bootstrap method (e.g., Davi-son and Hinkley 1997). To obtain confidence intervalsfor BM, the distribution of the studentized statistic

Tn � �B̂M,n � BM��̂M,n

is approximated by a bootstrap sample {Tn*i : i � 1, . . . ,

r}. If summands are independent, then this sample canbe formed by repeating the following steps for each i �1, . . . , r:

1) Resample W*s uniformly and with replacement from{Wt : t � 1, . . . , n} for each s � 1, . . . , n.

2) Set B̂*M,n � �W*t /n and

�̂*M,n ��W*t � B̂*M,n�2n�n � 1� .3) Set Tn*

i � (B̂*M,n � B̂M,n)/�̂*M,n.

Block bootstrapping (e.g., Wilks 1997) can be em-ployed if the summands are serially dependent. Boot-strap (1 � 2�) confidence intervals are then defined bythe limits

FIG. 2. Estimates B̂m,n (upper thick) and B̂,n (lower thick) forthe ECMWF (solid) and MF (dashed) forecasts of October rain-fall at Jakarta exceeding different thresholds during the period1958–95. Thresholds are marked as quantiles (lower axis) andabsolute values (mm, upper axis) of the observed rainfall. Stan-dard errors (upper thin) and their conditional versions (lowerthin; see appendix A) are shown for the ECMWF (solid) and MF(dashed) forecasts, and are indistinguishable for B̂m,n and B̂,n.


B̂M,n � �̂M,nT*n�r�1�k� and B̂M,n � �̂M,nT*n

�k�,

where k � ��r� and Tn*(1) � . . . � T*(r)n are the orderstatistics of the bootstrap sample. Neither the normalnor the bootstrap confidence limits are guaranteed tofall in the interval [0, 1], so they will always hereafter betruncated at the end points.

These confidence intervals can be used to test hy-potheses of the form BM � b, for some reference valueb that represents minimal forecast quality. If a two-sided (1 � �) confidence interval for BM does not con-tain b, then the hypothesis is rejected in favor of thetwo-sided alternative hypothesis BM � b at the 100�%level. One common reference value for Bm is the Brierscore, q2 � (1 � 2q)�It /n, obtained if the same prob-ability q is forecast at every time t. Another is the ex-pected Brier score, (2m � 1)/(6m) or 1/3, obtained ifthe forecast at each time t is selected independentlyfrom a uniform distribution on either the set {i/m : i �0, . . . , m} or the interval [0, 1].

The dark gray bands in the top two panels of Fig. 3are bootstrapped 90% confidence intervals (using r �5000) for Bm for the ECMWF and MF rainfall forecasts.The ECMWF forecasts are significantly worse than cli-

matology (the constant forecast q � �It /n) at the 10%level for a few thresholds, but are significantly betterthan random forecasts except between the 30% and70% quantiles (50–130 mm). The MF forecasts are notsignificantly different from climatology at any thresh-old, but are significantly better than random forecastsexcept between the 45% and 65% quantiles (70–110 mm).

4. Simulation study

a. Serial independence

We compare the performances of the proposed nor-mal and bootstrap confidence intervals for Bm with asimulation study. The performance of an equitailed(1 � 2�) confidence interval is commonly assessed byits achieved coverage and average length in repeatedsimulated datasets for which the true value of Bm isknown. Let B̂i be the point estimate and let Li and Ui bethe lower and upper confidence limits computed fromthe ith of N datasets. The achieved lower and uppercoverages are the proportions of times that Bm fallsabove and below the lower and upper limits; that is,

N�1�i�1

N

I�Bm � Li� and N�1�

i�1

N

I�Bm � Ui�,

which should both equal 1 � �. The average length isthe mean distance between the lower and upper limits;that is,

N�1�i�1

N

�Ui � Li�,

which should be as small as possible.The performance of the confidence intervals depends

on the ensemble size m, the sample size n, the thresh-olds u and �, the target coverage defined by �, and thejoint distribution of the observations and forecasts. Weexamine the effects of all of these factors in this simu-lation study, although a complete investigation is im-possible. Serially independent observations are simu-lated from a standard normal distribution. Ensemblemembers are also normally distributed, and each has acorrelation � with its contemporary observation but isotherwise independent of the other members. Forecastsare simple proportions [(3)] and we use thresholds u �� equal to p quantiles of the standard normal distribu-tion. We consider the following values for the variousfactors: m � 2, 4, 8; n � 10, 20, 40; p � 0.5, 0.7, 0.9;� � 0, 0.4, 0.8; and � between 0.005 and 0.05. Results forp � 0.1 and 0.3 would be the same as for p � 0.9 and0.7, respectively, because the former could be obtainedby redefining events as deficits below thresholds, which

FIG. 3. (top) Brier scores B̂m,n (solid) for the ECMWF forecasts,with bootstrapped 90% confidence intervals for Bm (dark gray)and Bm,n (light gray; see appendix A) at each threshold. ExpectedBrier scores are also shown for random forecasts (dotted) andclimatology (dashed). (middle) The same as in the top panel butfor the MF forecasts. (bottom) The difference (solid) in B̂m,n be-tween the ECMWF and MF forecasts, with bootstrapped 90%confidence intervals for the differences between Bm (dark gray)and Bm,n (light gray).


does not alter the Brier score. We use N � 10 000datasets and r � 1000 bootstrap samples throughout.

We show results for m � 8 and n � 40 only; resultsare qualitatively similar for different values. Figure 4shows the lower and upper coverage errors for the nor-mal and bootstrap confidence intervals as � varies andwith different values for � and p. Figure 5 shows thecorresponding lengths. The coverage errors of thelower limits are usually positive (the lower limits aretoo low and overcover) while the upper errors are oftennegative (the upper limits are too low and undercover).The errors are always smaller for the bootstrap limitsthan for the normal limits. The bootstrap achieves thisby shrinking the lower limits and extending the upperlimits compared to the normal limits (not shown) tocapture asymmetry in the sampling distribution of theBrier score, producing wider intervals for the bootstrap,as revealed by Fig. 5. The interval lengths decrease asboth � and � increase.

b. Serial dependence

To investigate the sensitivity of the results to thepresence of serial dependence, the simulations were re-

peated with observations generated from a first-ordermoving-average process with correlation 0.5 at lag one.The standard errors were adjusted for the lag-one cor-relation and the block bootstrap was employed withblocks of size two. Results (not shown) were qualita-tively similar to those for serial independence, exceptthat both positive and negative lower coverage errorswere found. Errors were larger and intervals wider be-cause of the smaller effective sample sizes.

c. Modified bootstrap intervals

The bootstrap coverage errors in Fig. 4 are typicallyless than �/2. Errors decrease as n increases, so theseintervals may be acceptable for many applications. Im-provements are desirable, however, particularly forsmall sample sizes and rare events. Several modifica-tions have been explored by the author, specifically ba-sic and percentile bootstrap intervals and bootstrapcalibration (Davison and Hinkley 1997, chapter 5) anda continuity correction (Hall 1987) to account for thediscrete nature of the summands of the Brier score.None of these methods improved significantly on thestudentized intervals presented above. A variance-

FIG. 4. Monte Carlo estimates of normal (solid) and bootstrap (dashed) lower (left) andupper (right) coverage errors plotted against � when m � 8; n � 40; � � 0 (thin), 0.4(medium), and 0.8 (thick); and p � (top) 0.9, (middle) 0.7, and (bottom) 0.5. Solid horizontallines mark zero error.


stabilizing transformation proposed by DiCiccio et al.(2006) was also applied and found to reduce the cov-erage error in the lower limit, especially for rare eventsfor which errors were approximately halved. The effecton the upper limits was small. A large part of the cov-erage error in small samples arises from the fact thatthe Brier score can take only a small number of distinctvalues. One way to reduce these errors is to smooth theBrier score by adding a small amount of random noise(Lahiri 1993). Investigations unreported here show thatthis can indeed reduce coverage errors significantly atthe expense of widening the confidence intervals. How-ever, results depend strongly on the amount of smooth-ing employed and making general recommendations isdifficult. An alternative solution could be to fit jointprobability distributions to the observations and fore-casts before determining the forecast probabilities(Bradley et al. 2003). This would allow the forecasts,and hence the Brier score, to take any values on theinterval [0, 1], and so avoid discretization errors. An-other advantage would be the avoidance of intervalswith zero length, which occurs for both the normal andbootstrap intervals described above when all summandsof the Brier score are equal.

5. Comparing Brier scores

Consider the task of comparing the Brier scores oftwo forecasting systems, the first with ensemble size m

and the second with ensemble size m�, verified againstthe same set of observations. Quantities pertaining tothe second system will be distinguished with primes.We can compare the two systems by constructing a con-fidence interval for the difference, BM � B�M, betweenthe Brier scores that would be expected were both en-semble sizes equal to M. Such a comparison may help toidentify whether or not a perceived superiority of onesystem is due only to its larger ensemble size. If the(1 � �) confidence interval contains zero, then the nullhypothesis of equal scores is rejected in favor of thetwo-sided alternative at the 100�% level. We estimatethe difference between the Brier scores using unbiasedestimators [(8)], though the subsampling approach de-scribed at the end of section 2d could also be used.

Normal confidence intervals are defined by

�B̂M,n � B̂�M,n� � �M,nz,

where, if there is no serial dependence,

�M,n2 �

1n�n � 1� ��Wt � W�t� � �B̂M,n � B̂�M,n��

2.

As in section 3, this can be adjusted to account for serialdependence. Bootstrap intervals approximate the dis-tribution of

Dn � ��B̂M,n � B̂�M,n� � �BM � B�M��M,n

with a bootstrap sample {Dn*i : i � 1, . . . , r}. If the

summands are serially independent, then this samplecan be formed by repeating the following steps for eachi � 1, . . . , r.

1) Resample pairs (W*s , W�*s ) uniformly and with re-placement from {(Wt, W�t ) : t � 1, . . . , n} for each s� 1, . . . , n.

2) Compute B̂*M,n, B̂�*M,n, and �*M,n for the resampleddata.

3) Set Dn*i � [(B̂*M,n � B̂�*M,n) � (B̂M,n � B̂�M,n)]/�*M,n.

The first step preserves dependence between con-temporary summands of the two scores. Block boot-strapping may again be employed if the summands areserially dependent, and confidence limits take the form

�B̂M,n � B̂�M,n� � �M,nDn*�l�

and

�B̂M,n � B̂�M,n� � �M,nDn*�k�.

Bootstrapped 90% confidence intervals for the dif-ference between Bm for the ECMWF and MF forecastsare illustrated by the dark gray bands in Fig. 3 (bottompanel). The scores are significantly different at the 10%level between only the 0.3- and 0.4-quantile thresholds(50–70 mm).

FIG. 5. Monte Carlo estimates of normal (solid) and bootstrap(dashed) interval lengths plotted against � when m � 8; n � 40;� � 0 (thin), 0.4 (medium), and 0.8 (thick); and p � (top) 0.9,(middle) 0.7, and (bottom) 0.5.


The statistical significance of the differences betweenBrier scores can also be quantified using hypothesistests. The powers of four such tests are investigated inappendix B, where the permutation test is found to bean attractive alternative to the bootstrap test presentedabove. The permutation test yields similar results forour data, however, with p values less than 0.1 betweenonly the 0.3- and 0.4-quantile thresholds.

6. Multiple comparisons

We have so far constructed confidence intervalsseparately for each threshold u. These intervals are de-signed to contain the quantity of interest, such as anexpected score or the difference between two expectedscores, with a certain probability at each individualthreshold. We may wish, however, to construct confi-dence intervals such that the quantity of interest is con-tained simultaneously within the intervals at all thresh-olds with a certain probability. We describe how toconstruct such confidence intervals in this section.

Denote by B(u) the quantity of interest at thresholdu and suppose that we want to consider a collection S ofthresholds. We aim to find confidence limits L(u) andU(u) for each u ∈ S such that

Pr{L�u� � B�u� � U�u� for all u∈S } � 1 � 2.

�10�

If we used the (1 � 2�) confidence limits proposed inprevious sections for L(u) and U(u), then this probabil-ity would be less than 1 � 2�. For example, if scoreswere independent between thresholds, then the prob-ability would be (1 � 2�)|S |, where |S | is the number ofthresholds.

Simultaneous confidence limits can be obtained usinga bootstrap method described by Davison and Hinkley(1997, section 4.2.4). Suppose that equitailed confi-dence intervals at each threshold u are based on a boot-strap sample {T*i(u) : i � 1, . . . , r} and have the form

L�u� � B̂�u� � �̂�u�T*�r�1�k��u�,

U�u� � B̂�u� � �̂�u�T*�k��u�

for some 1 � k � r/2. If we use these limits to formsimultaneous intervals, then the bootstrap estimate ofthe coverage probability [(10)] is

1r �i�1

r

I�T*�k��u� � T*i�u� � T*�r�1�k��u� for all u ∈ S �.

It is sufficient, therefore, to choose k such that thisestimate is as close as possible to 1 � 2�.

The resampling must preserve dependence between

thresholds: the statistics {T*i(u) : u ∈ S } should be com-puted from the same data for each i. So, resamplingschemes take the following form.

1) Resample (X*s , Y*s,1, . . . , Y*s,m) from {(Xt, Yt,1, . . . ,Yt,m) : t � 1, . . . , n} for each s � 1, . . . , n.

2) Compute T*i(u) for all u ∈ S.The size of the resample may also need to be larger

for simultaneous intervals. If scores were independentacross thresholds, the worst case, then the bootstrapestimate of the coverage would be approximately (1 �2k/r)|S |. If this is to equal 1 � 2�, then we require r �2k/[1 � (1 � 2�)1/|S |] � k|S | /� for large |S |. If � � 0.05and we want k � 5 for example, then r � 100|S |.

The dark gray bands in Fig. 6 are bootstrapped, si-multaneous 90% confidence intervals for Bm for theECMWF and MF rainfall forecasts. Considering allthresholds together, then, we find that, at the 10%level, neither the ECMWF nor MF forecasts differ sig-nificantly from climatology. The evidence for a differ-ence between the ECMWF and MF forecasts is mar-ginal at the 10% level.

7. Discussion

This article identified the effect of ensemble size onthe expected Brier score [(7)] and, given ensembles of

FIG. 6. (top) Brier scores B̂m,n (solid) for the ECMWF forecasts,with bootstrapped simultaneous 90% confidence intervals for Bm(dark gray) and Bm,n (light gray) at each threshold. ExpectedBrier scores are also shown for random forecasts (dotted) andclimatology (dashed). (middle) As in the top panel but for the MFforecasts. (bottom) The difference (solid) in B̂m,n between theECMWF and MF forecasts, with bootstrapped simultaneous 90%confidence intervals for the differences between Bm (dark gray)and Bm,n (light gray).


size m, an unbiased estimator [(8)] for the expectedBrier score that would be obtained for any other en-semble size. We assumed that ensemble members wereexchangeable, an acceptable assumption when the fore-cast lead time is long enough for systematic differencesbetween members to be lost. We proposed standarderrors and confidence intervals for the expected Brierscores and found that bootstrap intervals performedwell in a simulation study. When comparing the Brierscores from two forecasting systems, we proposed com-paring estimates of expected Brier scores that would beobtained were the ensemble sizes equal, and describedconfidence intervals for their difference. We showedthat if the Brier scores for several event definitions areof interest, then it is possible to construct confidenceintervals that simultaneously contain with a specifiedprobability the expected scores for all events.

We applied our methods to two sets of rainfall fore-casts. For forecasting low rainfall totals, MF forecastshad lower Brier scores than ECMWF forecasts, evenafter estimating the effect of increasing the ECMWFensemble size to infinity. Standard errors and confi-dence intervals suggested that the scores were onlymarginally significantly different at the 10% level for afew thresholds, and neither set of forecasts performedbetter than forecasting climatology.

Müller et al. (2005) have aims similar to ours but forthe more general quadratic ranked probability score(RPS). They note that the expected RPS for perfectlycalibrated but random ensemble forecasts exceeds theRPS obtained by forecasting climatology, which isequivalent to a perfectly calibrated random ensembleforecast with infinite ensemble size. This is analogousto Bm exceeding B. Instead of using climatology as thereference forecast in RPS skill scores, they thereforepropose using a perfectly calibrated random ensembleforecast with an ensemble size equal to that of the fore-casts being assessed. This is equivalent to our proposalof comparing B̂m,n with Bm instead of B.

Müller et al. (2005) also produce confidence bandsrepresenting the sampling variation in the RPS skillscore for random forecasts that arises among differentobservation–forecast datasets. Comparing a forecastsystem’s skill score with these bands provides a guide toits statistical significance relative to a random forecast,but does not provide a formal statistical test becausethe sampling variation in the system’s skill score is ig-nored. Our confidence intervals differ substantially:they are confidence intervals for the expected score ofthe forecast system being employed and can, therefore,be used to make comparisons with the expected scoreof any reference forecast, not only random forecasts,

and can also be used to compare the expected scores oftwo forecasting system.

The methods presented in this article can be ex-tended in several ways. We have defined events as ex-ceedances of thresholds for simplicity, but the samemethods could be applied for events defined by mem-bership of general sets. We have also considered scalarobservations and forecasts for simplicity, but multivari-ate data can be handled with the same methods; forexample, events could be defined by membership ofmultidimensional sets. The methods presented here canalso be extended to multiple-category Brier scores(Brier 1950) and to the RPS. Computer code for theprocedures presented in this article and written in thestatistical programming language R is available fromthe author.

Acknowledgments. Conversations with Professor I.T. Jolliffe and Drs. C. A. S. Coelho, D. B. Stephenson,and G. J. van Oldenborgh (who provided the data),plus comments from the referees, helped to motivateand improve this work.

APPENDIX A

Conditional Brier Scores

a. Unbiased estimators

We discussed the expected Brier score [(1)] in themain part of the paper, where the expectation wastaken over repeated sampling of forecasts and observa-tions. We investigate the conditional expected Brierscore in this appendix, where the expectation is takenover repeated sampling of forecasts, but where the ob-servations remain fixed. This quantity would be of in-terest if we wanted to know how a forecasting systemwould have performed on a particular set of target ob-servations for different ensemble sizes. As before, weshall find the effects of ensemble size and constructunbiased estimators, standard errors, and confidenceintervals.

The only source of variation in the conditional case isthe generation of ensemble members: each observationXt, and the corresponding model details Zt, are fixed.Consequently, we no longer need to assume stationar-ity, and the conditional expected Brier score is

Bm,n �1n �E ��Q̂t � It�

2 |Xt, Zt�

�1n ��E�Q̂t

2 |Zt� � 2E�Q̂t |Zt�It � It�

since Xt determines It. To see the effects of ensemblesize, we assume again that the forecasts Q̂t are simple


proportions [(3)] and that ensemble members are ex-changeable. Then, for each t,

E�Q̂t |Zt� � Qt

and

E�Q̂t2 |Zt� �

Qtm

�m � 1

mRt

for some Qt and Rt independent of m, and

Bm,n �1n ��Rt � 2ItQt � It � 1m �Qt � Rt��

� B�,n �1

mn ��Qt � Rt�,

where B,n is the conditional counterpart of B. Theadjusted Brier score [(8)] is again unbiased for BM,n.

b. Standard errors

Estimating the uncertainty about BM,n due to sam-pling variation is harder than for BM because we nolonger assume stationarity. The contribution to thesampling variation must therefore be quantifiable foreach ensemble separately. This is easier if westrengthen our assumption of exchangeability to one ofindependent and identically distributed members. Thisassumption is difficult to test empirically for ensembleforecasts with a distribution that changes through timeand requires further investigation (cf. Bergsma 2004) sowe appeal to the long lead time of our rainfall forecastsfor justification. In this case, the number Kt of membersthat forecast the event in the ensemble at time t has abinomial distribution with mean mQt and variancemQt(1 � Qt). After some lengthy algebra, the condi-tional variance, �2M,n, of B̂M,n can be shown to satisfy

MmnM,n2

M � 1�

2�3 � 2m��1 � 1M�n�m � 1� �Qt

4 �

4�m � 2 � 3 � 2mM �n�m � 1� �Qt

3 �

2�1 � 2m � 3M � 1 � 7 � 5m2M�M � 1��n�m � 1� �Qt

2

�1

n�m � 1�M �Qt �8n �It Qt

3 �12n �ItQt2 � 4n �ItQt .

This variance decreases as m�1 for large m, so B̂M,n isconsistent for BM,n as m → . An unbiased estimatorfor �2M,n can be constructed if m � 3 by replacing eachQst in the previous equation with

Kt�Kt � 1�. . .�Kt � s � 1�m�m � 1�. . .�m � s � 1�

for positive integers s. The square root, �̃M,n, of thisunbiased estimator is then an estimator for the condi-tional standard error of B̂M,n. If m � 3, then we canreplace Qst with (Kt/m)

s instead, but note that if m � 1,then �̃M,n is always zero.

Estimates of these conditional standard errors areshown in Fig. 2 for the ECMWF and MF rainfall fore-casts. As expected, they are smaller than their uncon-ditional counterparts, which reflect the additionalvariation from sampling observations. In fact, the con-ditional standard errors are small enough to suggestthat the superiority of the MF forecasts at thresholdsbelow 90 mm is statistically significant and would re-main for these particular observations even if differentensemble members were sampled.

c. Confidence intervals

A normal (1 � 2�) confidence interval for BM,n is

B̂M,n � ̃M,nz.

Bootstrap intervals approximate the distribution of

TM,n � �B̂M,n � BM,n�

̃M,n

by a bootstrap sample {T*iM,n : i � 1, . . . , r}. This samplecan be formed by repeating the following steps for eachi � 1, . . . , r.

1) Resample Y*t,j from {Yt,i : i � 1, . . . , m} for each j �1, . . . , m, and repeat for each t � 1, . . . , n.

2) Form B̂*M,n and �̃*M,n from these resampled en-sembles in the same way that the original ensembleswere used to form B̂M,n and �̃M,n.

3) Set TM,n*i � (B̂*M,n � B*M,n)/�̃*M,n, where

B*M,n �1n ��Q̂t2 � 2ItQ̂t � It � 1MQ̂t�1 � Q̂t�� .

Bootstrap (1 � 2�) confidence limits then take theform

B̂M,n � ̃M,nT*(l)

M,n and B̂M,n � ̃M,nT*(k)

M,n .

Bootstrapped 90% confidence intervals for Bm,n areillustrated for the ECMWF and MF forecasts in Fig. 3.Again, the intervals are narrower than those for Bm.The ECMWF forecasts are now significantly worsethan climatology for many thresholds, that is, they areunlikely to do as well as climatology for these observa-


tions were new ensemble members to be sampled, butare significantly better than random forecasts exceptbetween the 30% and 50% quantiles (50–90 mm). TheMF forecasts are not significantly different than clima-tology at most thresholds, and are significantly betterthan random forecasts at all thresholds.

d. Simulation study

The simulation study of section 4 was repeated forBm,n. Results are not shown but were qualitatively simi-lar to those reported in section 4 for Bm except for rareevents (p � 0.9). In that case, bootstrap intervals re-main preferable to normal intervals except when � � 0,for which both intervals have large coverage errors.

e. Comparing Brier scores

Confidence intervals for the difference, BM,n � B�M,n,between the conditional expected Brier scores of twosystems are easy to construct if the forecasts from thetwo systems at any time t can be considered indepen-dent once the model details Zt and Z�t are fixed. Thisassumption might be violated if the ensemble genera-tion process causes pairing of members between thetwo systems, though any such dependence is likely todiminish with lead time. The distribution of

DM,n � ��B̂M,n � B̂�M,n� � �BM,n � B�M,n��

M,n,

where �2M,n � �̃2M,n � �̃�

2M,n, can be approximated by a

bootstrap sample of the quantity

D*M,n � ��B̂*M,n � B̂�*M,n� � �B*M,n � B�*M,n��

*M,n

to obtain confidence limits

�B̂M,n � B̂�M,n� � M,nD*(l)

M,n

and

�B̂M,n � B̂�M,n� � M,nD*(k)

M,n .

Resampling follows the scheme described earlier in thesection independently for each system.

Figure 3 (bottom panel) shows bootstrapped 90%confidence intervals for the difference between Bm,n forthe ECMWF and MF forecasts. The MF score is sig-nificantly lower than the ECMWF score for mostthresholds below 90 mm.

APPENDIX B

Hypothesis Tests

We used confidence intervals in section 5 to test nullhypotheses of equal expected Brier scores. Using thenormal confidence interval is equivalent to a z test [e.g.,the test labeled S1 by Diebold and Mariano (1995)] and

using the bootstrap interval is equivalent to a bootstraptest (e.g., Davison and Hinkley 1997, p. 171). Confi-dence intervals are useful for quantifying uncertaintyeven when no comparative test is attempted, but if com-parison is the goal, then other test procedures mightalso be employed. Hypothesis tests such as the sign andsigned-rank tests (Hamill 1999) test for differences be-tween medians and are inappropriate for testing differ-ences between Brier scores, which are sample means.Instead, we compare the powers of the z and bootstraptests with those of a t test and a permutation test(Hamill 1999) in a simulation study.

The study design is similar to that in section 4, exceptthat two sets of forecasts are simulated, one uncorre-lated with the observations (�1 � 0) while the correla-tion for the other set is varied from �2 � 0 to �2 � 1. Thesets have the same expected Brier score when �2 � 0and the scores diverge as �2 increases. For each value of�2, 10 000 datasets are generated and subjected to thefour tests at the 10% significance level. Figure B1 (leftpanel) shows Monte Carlo estimates of power for thefour tests when m � 8, n � 20, and p � 0.5. All fourtests have similar powers, although the z test is slightlyoversized and the bootstrap test has slightly lowerpower far from the null hypothesis.

The z test in Diebold and Mariano (1995) is adaptedto handle serial dependence, and block resampling canbe used for the permutation and bootstrap tests. Thepower study is repeated with observations simulatedfrom a first-order moving-average process with corre-lation 0.5 at lag one. Powers for these three tests areplotted in Fig. B1 (right panel) and show that the z testand, to a lesser extent, the bootstrap tests are oversized,

FIG. B1. Monte Carlo estimates of the powers of the bootstrap(solid), permutation (dashed), and z (dotted) and t tests (dotted–dashed) against correlation �2 (see text) for (left) serially inde-pendent and (right) dependent observations.


while the permutation test has remained well sized andits power has reduced only slightly from the indepen-dent case. From this limited study, the permutation testappears to be an attractive alternative to the bootstraptest for differences between Brier scores.

REFERENCES

Bergsma, W. P., 2004: Testing conditional independence for con-tinuous random variables. EURANDOM Tech. Rep. 2004–048, 19 pp.

Bradley, A. A., T. Hashino, and S. S. Schwartz, 2003: Distribu-tions-oriented verification of probability forecasts for smalldata samples. Wea. Forecasting, 18, 903–917.

Brier, G. W., 1950: Verification of forecasts expressed in terms ofprobability. Mon. Wea. Rev., 78, 1–3.

Briggs, W., and D. Ruppert, 2005: Assessing the skill of yes/nopredictions. Biometrics, 61, 799–807.

Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size onensemble prediction. Mon. Wea. Rev., 126, 2503–2518.

Chatfield, C., 2004: The Analysis of Time Series: An Introduction.Chapman and Hall, 333 pp.

Davison, A. C., and D. V. Hinkley, 1997: Bootstrap Methods andTheir Application. Cambridge University Press, 592 pp.

DiCiccio, T. J., A. C. Monti, and G. A. Young, 2006: Variancestabilization for a scalar parameter. J. Roy. Stat. Soc., 68B,281–303.

Diebold, F. X., and R. S. Mariano, 1995: Comparing predictiveaccuracy. J. Bus. Econ. Stat., 13, 253–263.

Hall, P., 1987: On the bootstrap and continuity correction. J. Roy.Stat. Soc., 49B, 82–89.

Hamill, T. M., 1999: Hypothesis tests for evaluating numericalprecipitation forecasts. Wea. Forecasting, 14, 155–167.

Jolliffe, I. T., 2007: Uncertainty and inference for verificationmeasures. Wea. Forecasting, 22, 637–650.

——, and D. B. Stephenson, 2003: Forecast Verification: A Prac-titioner’s Guide in Atmospheric Science. John Wiley and Sons,240 pp.

Kane, T. L., and B. G. Brown, 2000: Confidence intervals forsome verification measures–a survey of several methods. Pre-prints, 15th Conf. on Probability and Statistics, Asheville, NC,Amer. Meteor. Soc., 46–49.

Lahiri, S. N., 1993: Bootstrapping the Studentized sample mean oflattice variables. J. Mult. Anal., 45, 247–256.

Müller, W. A., C. Appenzeller, F. J. Doblas-Reyes, and M. A.Liniger, 2005: A debiased ranked probability skill score toevaluate probabilistic ensemble forecasts with small en-semble sizes. J. Climate, 18, 1513–1523.

Palmer, T. N., and Coauthors, 2004: Development of a EuropeanMultimodel Ensemble System for Seasonal to InterannualPrediction (DEMETER). Bull. Amer. Meteor. Soc., 85, 853–872.

Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practi-tioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B.Stephenson, Eds., John Wiley and Sons, 13–36.

Richardson, D. S., 2001: Measures of skill and value of ensembleprediction systems, their interrelationship and the effect ofensemble size. Quart. J. Roy. Meteor. Soc., 127, 2473–2489.

Romano, J. P., 1988: A bootstrap revival of some nonparametricdistance tests. J. Amer. Stat. Assoc., 83, 698–708.

Seaman, R., I. Mason, and F. Woodcock, 1996: Confidence inter-vals for some performance measures of yes-no forecasts.Aust. Meteor. Mag., 45, 49–53.

Severini, T. A., 2005: Elements of Distribution Theory. CambridgeUniversity Press, 515 pp.

Stephenson, D. B., 2000: Use of the “odds ratio” for diagnosingforecast skill. Wea. Forecasting, 15, 221–232.

Thornes, J. E., and D. B. Stephenson, 2001: How to judge thequality and value of weather forecast products. Meteor.Appl., 8, 307–314.

Wilks, D. S., 1997: Resampling hypothesis tests for autocorrelatedfields. J. Climate, 10, 65–82.

——, 2006: Statistical Methods in the Atmospheric Sciences. 2d ed.Academic Press, 627 pp.

Woodcock, F., 1976: The evaluation of yes/no forecasts for scien-tific and administrative purposes. Mon. Wea. Rev., 104, 1209–1214.


Comparing Probabilistic Forecasting Systems with the Brier ...empslocal.ex.ac.uk/people/staff/ferro/Publications/ferro...Comparing Probabilistic Forecasting Systems with the Brier

Documents