Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble Tom Hopson 1 Josh Hacker 1, Yubao Liu 1, Gregory Roux 1, Wanli Wu.

Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble

Tom Hopson1

Josh Hacker1, Yubao Liu1, Gregory Roux1, Wanli Wu1, Jason Knievel1, Tom Warner1, Scott Swerdlin1,

John Pace2, Scott Halvorson2

2U.S. Army Test and Evaluation Command

1

OutlineI. Motivation: ensemble forecasting and post-

processingII. E-RTFDDA for Dugway Proving GroundsIII. Introduce Quantile Regression (QR; Kroenker

and Bassett, 1978)III. Post-processing procedureIV. Verification resultsV. Warning: dynamically finding ensemble

dispersion at risk ensemble mean utility VI. Conclusions

Goals of an EPS

• Predict the observed distribution of events and atmospheric states

• Predict uncertainty in the day’s prediction• Predict the extreme events that are possible on a

particular day• Provide a range of possible scenarios for a

particular forecast

1. Greater accuracy of ensemble mean forecast (half the error variance of single forecast)

2. Likelihood of extremes3. Non-Gaussian forecast PDF’s4. Ensemble spread as a representation of forecast

uncertainty=> All rely on forecasts being calibrated

Further … -- Argue calibration essential for tailoring to local application:

NWP provides spatially- and temporally-averaged gridded forecast output

-- Applying gridded forecasts to point locations requires location specific calibration to account for local spatial- and temporal-scales of variability ( => increasing ensemble dispersion)

More technically …

Dugway Proving Grounds, Utah e.g. T Thresholds

• Includes random and systematic differences between members.

• Not an actual chance of exceedance unless calibrated.

Challenges in probabilistic mesoscale prediction

• Model formulation• Bias (marginal and conditional)• Lack of variability caused by truncation and approximation• Non-universality of closure and forcing

• Initial conditions• Small-scales are damped in analysis systems, and the model must

develop them• Perturbation methods designed for medium-range systems may not be

appropriate• Lateral boundary conditions

• After short time periods the lateral boundary conditions can dominate• Representing uncertainty in lateral boundary conditions is critical

• Lower boundary conditions• Dominate boundary-layer response• Difficult to estimate uncertainty in lower boundary conditions

RTFDDA and Ensemble-RTFDDA

Liu et al. 2010 AMS Annual Meeting, 14th IOAS-AOLS, Atlanta, GA. January 18 – 23, [email protected]

The Ensemble Execution Module

Perturbations

observations

Member 1

Perturbations

observations

Member 2

Perturbations

observations

Member 3

Perturbations

observations

Member N

…

36-48h

fcsts

36-48h

fcsts

36-48h

fcsts

36-48h

fcsts

Input to decision support

tools

Postprocessing

Archiving and verification

RTFDDA

RTFDDA

RTFDDA

RTFDDA

Liu et al. 2010 AMS Annual Meeting, 14th IOAS-AOLS, Atlanta, GA. January 18 – 23, [email protected]

Operated at US Army DPG

since Sep. 2007

D1

D2

D3

Surface and X-sections – Mean, Spread, Exceedance Probability, Spaghetti, …

Likelihood for SPD > 10m/s

Mean T & Wind

T Mean and SD

Wind Speed

T-2m

Wind Rose

Pin-point Surface and Profiles – Mean, Spread, Exceedance probability, spaghetti, Wind roses, Histograms …

Real-time Operational Products for DPG

Forecast “calibration” or “post-processing”Pr

obab

ility

calibration

Flow rate [m3/s]

Prob

abili

ty

Post-processing has corrected:• the “on average” bias• as well as under-representation of the 2nd moment of the empirical forecast PDF (i.e. corrected its “dispersion” or “spread”)

“spread” or “dispersion”

“bias”obs

obs

ForecastPDF

ForecastPDF

Flow rate [m3/s]

Our approach:• under-utilized “quantile regression” approach• probability distribution function “means what it says”• daily variation in the ensemble dispersion directly relate to changes in forecast skill => informative ensemble skill-spread relationship

Example of Quantile Regression (QR)

Our application

Fitting T quantiles using QR conditioned on:

1) Ranked forecast ens

2) ensemble mean

3) ensemble median

4) ensemble stdev

5) Persistence

T [K

]

Timeforecastsobserved

Regressor set: 1. reforecast ens2. ens mean3. ens stdev 4. persistence 5. LR quantile (not shown)

Prob

abili

ty/°

K

Temperature [K]

climatologicalPDF

Step I: Determineclimatological quantiles

Step 2: For each quan, use “forward step-wisecross-validation” to iteratively select best subsetSelection requirements: a) QR cost function minimum, b) Satisfy binomial distribution at 95% confidenceIf requirements not met, retain climatological “prior”

1.

3.2.

4.

Step 3: segregate forecasts into differing ranges of ensemble dispersion and refit models (Step 2) uniquely for each range

Time

forecasts

T [K

]

I. II. III. II. I.Pr

obab

ility

/°K

Temperature [K]

ForecastPDF

prior

posterior

Final result: “sharper” posterior PDFrepresented by interpolated quans

Measures Used:1) Rank histogram (converted to scalar measure)2) Root Mean square error (RMSE)3) Brier score4) Rank Probability Score (RPS)5) Relative Operating Characteristic (ROC) curve6) New measure of ensemble skill-spread utility

=> Using these for automated calibration model selection by using weighted sum of skill scores of each

Utilizing Verification measures near-real-time …

Problems with Spread-Skill Correlation … ECMWF spread-skill

(black) correlation << 1

Even “perfect model” (blue) correlation << 1 and varies with forecast lead-time

ECMWFr = 0.33“Perfect”r = 0.68

ECMWFr =“Perfect”r = 0.56



1 day

7 day

4 day

10 day

National Security Applications Program Research Applications Laboratory

3-hr dewpoint time seriesBefore Calibration After Calibration

Station DPG S01

42-hr dewpoint time seriesBefore Calibration After Calibration

Station DPG S01

obs

Blue is “raw” ensembleBlack is calibrated ensembleRed is the observed value

Notice: significant change in both “bias” and dispersion of final PDF

(also notice PDF asymmetries)

PDFs: raw vs. calibrated


3-hr dewpoint rank histogramsStation DPG S01


Station DPG S01

42-hr dewpoint rank histograms

Skill Scores

• Single value to summarize performance.• Reference forecast - best naive guess;

persistence, climatology• A perfect forecast implies that the object

can be perfectly observed• Positively oriented – Positive is good

SS =Aforc −Aref

Aperf −Aref


Skill Score VerificationRMSE Skill Score CRPS Skill Score

Reference Forecasts:Black -- raw ensembleBlue -- persistence

Computational Resource Questions:

How best to utilize a multi-model simulations (forecast), especially if under-dispersive?

a) Should more dynamical variability be searched for? Orb) Is it better to balance post-processing with multi-model

utilization to create a properly dispersive, informative ensemble?


3-hr dewpoint rank histogramsStation DPG S01


RMSE of ensemble members

3hr Lead-time 42hr Lead-time

Station DPG S01


Significant calibration regressors

3hr Lead-time 42hr Lead-time

Station DPG S01

Questions revisited:How best to utilize a multi-model simulations (forecast),

especially if under-dispersive?

a) Should more dynamical variability be searched for? Orb) Is it better to balance post-processing with multi-model

utilization to create a properly dispersive, informative ensemble?

Warning: adding more models can lead to decreasing utility of the ensemble mean (even if the ensemble is under-dispersive)

Summary Quantile regression provides a powerful framework for improving the whole (potentially non-gaussian) PDF of an ensemble forecast – different regressors for different quantiles and lead-times

This framework provides an umbrella to blend together multiple statistical correction approaches (logistic regression, etc., not shown) as well as multiple regressors

As well, “step-wise cross-validation” based calibration provides a method to ensure forecast skill no worse than climatological and persistence for a variety of cost functions

As shown here, significant improvements made to the forecast’s ability to represent its own potential forecast error (while improving sharpness):

– uniform rank histogram– significant spread-skill relationship (new skill-spread measure)

Care should be used before “throwing more models” at an “under-dispersive” forecast problem

Further questions: [email protected] or [email protected]

mailto:[email protected]

mailto:[email protected]

Dugway Proving Ground

other options …Assign dispersion bins, then:

2) Average the error values in each bin, then correlate

3) Calculate individual rank histograms for each bin, convert to a scalar measure

Example: French Broad RiverBefore Calibration => underdispersive

Black curve shows observations; colors are ensemble

Rank Histogram Comparisons

After quantile regression, rank histogram more uniform(although now slightly over-dispersive)

Raw full ensemble After calibration

Frequency Used forQuantile Fitting of Method I:

Best Model=76%Ensemble StDev=13%Ensemble Mean=0%Ranked Ensemble=6%

What Nash-Sutcliffe (RMSE) implies about Utility

Note:

Take home message:

For a “calibrated ensemble”, error variance of the ensemble mean is 1/2 the error variance of any ensemble member (on average), independent of the distribution being sampled

Prob

abili

ty

obsForecastPDF

Discharge

i=ensembleaverage

( fi −o)2iversus ( f −o)2

i

Simplifying

eq1 : fi2 −2of + o2

eq2 : f 2 −2of + o2

o : fj ⇒ j

eq1 : 2 f 2 − f 2( )

eq2 : f 2 − f 2

⇒ eq1=2 eq2

Sequentially-averaged models (ranked based on NS Score) and their resultant NS Score

=> Notice the degredation of NS with increasing # (with a peak at 2 models)

=> For an equitable multi-model, NS should rise monotonically

=> Maybe a smaller subset of models would have more utility? (A contradiction for an under-dispersive ensemble?)

What Nash-Sutcliffe (RMSE) implies about Utility (cont)

-- degredation with increased ensemble size

Initial Frequency Used forQuantile Fitting:


What Nash-Sutcliffe implies about Utility (cont)

Reduced Set Frequency Used for Quantile Fitting:


…using only top 1/3 of modelsTo rank and form ensemble mean …… earlier results …

=> Appears to be significant gains in the utility of the ensemble after “filtering” (except for drop in StDev) … however “proof is in the pudding” …=> Examine verification skill measures …

Skill Score Comparisonsbetween full- and “filtered” ensemble sets

Points:

-- quite similar results for a variety of skill scores-- both approaches give appreciable benefit over the original raw multi-model output-- however, only in the CRPSS is there improvement of the “filtered” ensemble set over the full set

=> post-processing method fairly robust=> More work (more filtering?)!

GREEN -- full calibrated multi-modelBLUE -- “filtered” calibrated multi-modelReference – uncalibrated set

Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble Tom Hopson 1 Josh Hacker 1, Yubao Liu 1, Gregory Roux 1, Wanli Wu.

Documents

risk ensemble

ensemblertfdda liu

h fcsts36

dpg9 forecast calibration

msmean t wind t mean

gaussian forecast pdfsensemble

h fcstsinput

exceedance probability