Top Banner
Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop – Warwick December 2009
23

Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Probability Forecasting, Probability Evaluation, and Scoring Rules:

Expanding the Toolbox

Robert L. Winkler

Duke University

Subjective Bayes Workshop – Warwick

December 2009

Page 2: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Outline of Presentation• Probability Forecasting• Why Probability Evaluation?• Scoring Rules: Incentives and Evaluation• Some Issues and Recent Developments

• Extended Families of Scoring Rules

• Relative Evaluation

• Taking Order into Account

• Probability Assessment vs. Quantile Assessment

• Calibration, Sharpness, etc. – What Matters?

• Competition among Forecasters

• Concluding Thoughts

Page 3: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Probability Forecasting

• Formal representation of uncertainty

• Various sources of forecasts – all with some subjective element• Subjective forecasts from “experts”• Probabilities from Bayesian analyses• Probabilities from other modeling

Page 4: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Why Probability Evaluation?

• Recognition of importance of uncertainty• Some increase in use of probability forecasts• Concern about levels of risk

• Importance of ex post evaluation• Connection of forecasts with reality• Concern about impact of “poor” forecasts• Improvement of future probability forecasts• Identification of better probability forecasters• Keep forecasters honest (accountability!)

Page 5: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Scoring Rules

• As incentives• Strictly proper scoring rules • Maximizing Expected Score honest reporting,

incentive for “better” forecasts

• As evaluation measures• Overall measures of accuracy• Decompositions for specific characteristics

• Commonly-used rules

• Some background, historical notes

Page 6: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Quadratic Spherical Logarithmic

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

0.00 0.20 0.40 0.60 0.80 1.00

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

0.00 0.20 0.40 0.60 0.80 1.00 -0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.00 0.20 0.40 0.60 0.80 1.00

Score Score Score

Expected Score (n=2) Expected Score (n=2)Expected Score (n=2)

Commonly-Used Rules

Page 7: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Some Issues & Recent Developments

• My own biased view!

• Based in part on recent work with Casey Lichtendahl, Victor Richmond Jose, Bob Nau, and others

• Influenced greatly by many years of work with Allan Murphy

Page 8: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Extended Families of Scoring Rules

• Power and Pseudospherical families, each with a single parameter β (– < β < )

• Power family includes quadratic rule (β = 2)

• Pseudospherical family includes spherical rule (β = 2)

• Both families include logarithmic rule (β → 1)

• Provide rich families of strictly proper rules

Page 9: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Power and Pseudospherical Families

1

1

11/

1

1 1( ) 1

1

1( ) / 1

1

njP

j ii

nSj j i

i

rS r

S r r

r

r

Page 10: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Relative Evaluation

• Standard scoring rules are symmetric• Maximum ES is smallest when the distribution is

uniform – forecasts are rewarded for improvements over a uniform distribution

– Is the uniform distribution a suitable baseline?

• Evaluation relative to a non-uniform baseline• Often makes more sense• Facilitates comparability of scores from different

situations – improvements over their baselines• Motivated “Skill Score” (but not strictly proper)

Page 11: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Scoring Rules with Baseline Distributions

• Strictly proper asymmetric rules

• Generalized Power & Pseudospherical families• Strictly proper• Score = 0 when forecast = baseline• ES > 0 when forecast ≠ baseline

Page 12: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Power and Pseudospherical Families with Baselines

1

11

11/

11

( / ) 1 1( | ) 1

1

1( | ) / / 1

1

nj jP i

ji i

nS ij j j

i i

r q rS

q

rS r q

q

r q

r q

Page 13: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Expected Scores (β = 2)

POWER SCORE, q = (0.2, 0.7, 0.1)

QUADRATIC SCORE SPHERICAL SCORE

PSEUDOSPHERICAL SCORE, q = (0.2,0.7, 0.1)

State 1 State 2

State 3

State 1 State 2

State 3

State 1 State 2

State 3

State 1 State 2

State 3

Page 14: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Taking Order Into Account

• What if the events of interest are ordered? • Standard scoring rules ignore the order • For events other than the one that occurs, the

probabilities matter but not “where they are”

• Do we want the scoring rule to reflect order?• Then we want it to be “sensitive to distance”• Having more probability on events “close” to the

event that happens than on events “far” from the event that happens results in a higher score.

Page 15: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Scoring Rules that Reflect Order

• First such rule: ranked probability score• Based on quadratic scoring rule• Reflects order by using cumulative probabilities

• The same approach can be used with any basic scoring rule

• Can generate Power and Pseudospherical families that include baseline distributions and are sensitive to order

Page 16: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Sensitive to Distance Rules

1 12 2( ) (1 )1

1 1( ) ( ,1 ) ( ,1 )2 11

1 1( | ) ( ,1 | ) ( ,1 | )2 11

j nRPS R Rj i ii ji

j nS S R R S R Rj i i i ii ji

j nS S R R Q S R R Qj i i i i i ii ji

r

r

r q

Page 17: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Expected Scores (β = 2)

QUADRATIC SCORE RANKED PROBABILITY SCORE

POWER SCORE, q = (1/3,1/3,1/3) POWER SCORE, q = (0.7, 0.2, 0.1)

State 1 State 2

State 3

State 1 State 2

State 3

State 1State 2

State 3

State 1 State 2

State 3

Page 18: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Probabilities vs. Quantiles• Often we have ordered events or values of a

variable, as noted earlier

• We might have quantile forecasts, or a mix of probability forecasts and quantile forecasts

• Why not just use the previously-discussed scoring rules to evaluate quantiles?• They provide improper incentives for quantiles• Can be gamed to get perfect calibration artificially

Page 19: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Scoring Rules for Quantiles

• Scores are based on the quantiles and on the actual value of the variable

• Rules based on linear loss functions

• Can be used for multiple quantiles• Special case: interval forecasts (2 quantiles)

• Strictly proper for quantile assessment

Page 20: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Calibration & Sharpness

• How much should we care about calibration?• Evaluation of probability forecasts often focuses

almost exclusively on calibration• Forecasters can game calibration measures• A good Bayesian should try to recalibrate

probabilities (viewing them as new information)• A key issue: will the forecasts be taken at face

value?

Page 21: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Calibration & Sharpness, cont.

• What about sharpness?• Sharpness is a better indicator of how informative

the forecasts are (or can be)• Sharpness can often be improved with effort (more

data-gathering, improved models)• Limited exploration suggests that sharpness has

greater impact on overall scores than calibration

• Ideal: Maximize sharpness subject to good calibration

• Pay more attention to sharpness measures

Page 22: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Competition Among Forecasters• Forecasters can have goals other than maximizing ES

• Utility could be nonlinear in the score• Risk averse• Step function with step at “target score”

• Goal could be to outscore other forecasters• Brings in game-theoretic considerations• Results in forecasts that are more extreme than the forecaster’s

probabilities

• Goal could be to be similar to other forecasters, or not to be near the bottom of the group in scores

• Leads to so-called “herding” behavior

• Tricky to model• We don’t know exactly what a forecaster’s goals are• Can’t recover the forecaster’s probabilities from those that

are reported

Page 23: Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Concluding Thoughts/Recommendations• Encourage greater use of probabilities• Encourage consistent evaluation of probabilities• Rich families of scoring rules

• Need better understanding of characteristics of different rules

• Choice of a baseline distribution and sensitivity to distance more important than value of β

• Tailor rules to other probability-related forecasts, such as quantiles, for proper incentives

• Pay more attention to sharpness, less to calibration• Be aware of possible other goals and their impact on

reported probabilities• Other issues in probability forecasting & evaluation