Top Banner
Young Statisticians Conference 7 February 2013
118
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ysc2013

YoungStatisticiansConference

7 February 2013

Page 2: Ysc2013

YoungStatisticiansConference

7 February 2013

Page 3: Ysc2013

YoungStatisticiansConference

7 February 2013

Page 4: Ysc2013

Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts

Man vs Wild Data Where fools fear to tread 2

Page 5: Ysc2013

My story

Man vs Wild Data Where fools fear to tread 3

Olympic video poker slots

Beware of smelly clients

Threats and slander

Nerves in court

Three universityconsulting services

Reviewing my ownwork

Six times an expertwitness

Hundreds of clients

Page 6: Ysc2013

My story

Man vs Wild Data Where fools fear to tread 3

Olympic video poker slots

Beware of smelly clients

Threats and slander

Nerves in court

Three universityconsulting services

Reviewing my ownwork

Six times an expertwitness

Hundreds of clients

Page 7: Ysc2013

My story

Man vs Wild Data Where fools fear to tread 3

Olympic video poker slots

Beware of smelly clients

Threats and slander

Nerves in court

Three universityconsulting services

Reviewing my ownwork

Six times an expertwitness

Hundreds of clients

Page 8: Ysc2013

My story

Man vs Wild Data Where fools fear to tread 3

Olympic video poker slots

Beware of smelly clients

Threats and slander

Nerves in court

Three universityconsulting services

Reviewing my ownwork

Six times an expertwitness

Hundreds of clients

Page 9: Ysc2013

My story

Man vs Wild Data Where fools fear to tread 3

Olympic video poker slots

Beware of smelly clients

Threats and slander

Nerves in court

Three universityconsulting services

Reviewing my ownwork

Six times an expertwitness

Hundreds of clients

Page 10: Ysc2013

My story

Man vs Wild Data Where fools fear to tread 3

Olympic video poker slots

Beware of smelly clients

Threats and slander

Nerves in court

Three universityconsulting services

Reviewing my ownwork

Six times an expertwitness

Hundreds of clients

Page 11: Ysc2013

My story

Man vs Wild Data Where fools fear to tread 3

Olympic video poker slots

Beware of smelly clients

Threats and slander

Nerves in court

Three universityconsulting services

Reviewing my ownwork

Six times an expertwitness

Hundreds of clients

Page 12: Ysc2013

My story

Man vs Wild Data Where fools fear to tread 3

Olympic video poker slots

Beware of smelly clients

Threats and slander

Nerves in court

Three universityconsulting services

Reviewing my ownwork

Six times an expertwitness

Hundreds of clients

Page 13: Ysc2013

Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts

Man vs Wild Data Working with inadequate tools 4

Page 14: Ysc2013

Disposable tableware company

Man vs Wild Data Working with inadequate tools 5

Problem: Want forecasts of each ofhundreds of items. Series can bestationary, trended or seasonal. Theycurrently have a large forecastingprogram written in-house but it doesn’tseem to produce sensible forecasts.They want me to tell them what iswrong and fix it.

Page 15: Ysc2013

Disposable tableware company

Man vs Wild Data Working with inadequate tools 5

Problem: Want forecasts of each ofhundreds of items. Series can bestationary, trended or seasonal. Theycurrently have a large forecastingprogram written in-house but it doesn’tseem to produce sensible forecasts.They want me to tell them what iswrong and fix it.

Additional informationProgram written in COBOL making numerical calculationslimited. It is not possible to do any optimisation.

Page 16: Ysc2013

Disposable tableware company

Man vs Wild Data Working with inadequate tools 5

Problem: Want forecasts of each ofhundreds of items. Series can bestationary, trended or seasonal. Theycurrently have a large forecastingprogram written in-house but it doesn’tseem to produce sensible forecasts.They want me to tell them what iswrong and fix it.

Additional informationProgram written in COBOL making numerical calculationslimited. It is not possible to do any optimisation.Their programmer has little experience in numericalcomputing.

Page 17: Ysc2013

Disposable tableware company

Man vs Wild Data Working with inadequate tools 5

Problem: Want forecasts of each ofhundreds of items. Series can bestationary, trended or seasonal. Theycurrently have a large forecastingprogram written in-house but it doesn’tseem to produce sensible forecasts.They want me to tell them what iswrong and fix it.

Additional informationProgram written in COBOL making numerical calculationslimited. It is not possible to do any optimisation.Their programmer has little experience in numericalcomputing.They employ no statisticians and want the program toproduce forecasts automatically.

Page 18: Ysc2013

Disposable tableware company

Methods currently used

A 12 month average

C 6 month average

E straight line regression over last 12 months

G straight line regression over last 6 months

H average slope between last year’s and thisyear’s values.(Equivalent to differencing at lag 12 andtaking mean.)

I Same as H except over 6 months.

K I couldn’t understand the explanation.

Man vs Wild Data Working with inadequate tools 6

Page 19: Ysc2013

Disposable tableware company

My solution

Use first differencing to deal with trend, or seasonaldifferencing to deal with seasonality.

Use simple exponential smoothing on (differenced)data with the parameter selected from{0.1,0.3,0.5,0.7,0.9}.For each series, try 15 models: no differencing, firstdifferencing, and seasonal differencing, plus SESwith 5 parameter values.

Model selected based on smallest MSE. (Only oneparameter for each model, so no need to penalizefor model size.)

Man vs Wild Data Working with inadequate tools 7

Page 20: Ysc2013

Disposable tableware company

My solution

Use first differencing to deal with trend, or seasonaldifferencing to deal with seasonality.

Use simple exponential smoothing on (differenced)data with the parameter selected from{0.1,0.3,0.5,0.7,0.9}.For each series, try 15 models: no differencing, firstdifferencing, and seasonal differencing, plus SESwith 5 parameter values.

Model selected based on smallest MSE. (Only oneparameter for each model, so no need to penalizefor model size.)

Man vs Wild Data Working with inadequate tools 7

Page 21: Ysc2013

Disposable tableware company

My solution

Use first differencing to deal with trend, or seasonaldifferencing to deal with seasonality.

Use simple exponential smoothing on (differenced)data with the parameter selected from{0.1,0.3,0.5,0.7,0.9}.For each series, try 15 models: no differencing, firstdifferencing, and seasonal differencing, plus SESwith 5 parameter values.

Model selected based on smallest MSE. (Only oneparameter for each model, so no need to penalizefor model size.)

Man vs Wild Data Working with inadequate tools 7

Page 22: Ysc2013

Disposable tableware company

My solution

Use first differencing to deal with trend, or seasonaldifferencing to deal with seasonality.

Use simple exponential smoothing on (differenced)data with the parameter selected from{0.1,0.3,0.5,0.7,0.9}.For each series, try 15 models: no differencing, firstdifferencing, and seasonal differencing, plus SESwith 5 parameter values.

Model selected based on smallest MSE. (Only oneparameter for each model, so no need to penalizefor model size.)

Man vs Wild Data Working with inadequate tools 7

Page 23: Ysc2013

Disposable tableware company

My solution

Use first differencing to deal with trend, or seasonaldifferencing to deal with seasonality.

Use simple exponential smoothing on (differenced)data with the parameter selected from{0.1,0.3,0.5,0.7,0.9}.For each series, try 15 models: no differencing, firstdifferencing, and seasonal differencing, plus SESwith 5 parameter values.

Model selected based on smallest MSE. (Only oneparameter for each model, so no need to penalizefor model size.)

Man vs Wild Data Working with inadequate tools 7

Some lessonsBe pragmatic.

Understand your tools well enough

to be able to adapt them.

A successful consulting job often

uses very simple methods.

Page 24: Ysc2013

Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts

Man vs Wild Data When you can’t lose 8

Page 25: Ysc2013

Forecasting the PBS

Man vs Wild Data When you can’t lose 9

Page 26: Ysc2013

Forecasting the PBS

The Pharmaceutical Benefits Scheme (PBS) isthe Australian government drugs subsidy scheme.

Many drugs bought from pharmacies aresubsidised to allow more equitable access tomodern drugs.

The cost to government is determined by thenumber and types of drugs purchased.Currently nearly 1% of GDP.

The total cost is budgeted based on forecastsof drug usage.

Man vs Wild Data When you can’t lose 10

Page 27: Ysc2013

Forecasting the PBS

The Pharmaceutical Benefits Scheme (PBS) isthe Australian government drugs subsidy scheme.

Many drugs bought from pharmacies aresubsidised to allow more equitable access tomodern drugs.

The cost to government is determined by thenumber and types of drugs purchased.Currently nearly 1% of GDP.

The total cost is budgeted based on forecastsof drug usage.

Man vs Wild Data When you can’t lose 10

Page 28: Ysc2013

Forecasting the PBS

The Pharmaceutical Benefits Scheme (PBS) isthe Australian government drugs subsidy scheme.

Many drugs bought from pharmacies aresubsidised to allow more equitable access tomodern drugs.

The cost to government is determined by thenumber and types of drugs purchased.Currently nearly 1% of GDP.

The total cost is budgeted based on forecastsof drug usage.

Man vs Wild Data When you can’t lose 10

Page 29: Ysc2013

Forecasting the PBS

The Pharmaceutical Benefits Scheme (PBS) isthe Australian government drugs subsidy scheme.

Many drugs bought from pharmacies aresubsidised to allow more equitable access tomodern drugs.

The cost to government is determined by thenumber and types of drugs purchased.Currently nearly 1% of GDP.

The total cost is budgeted based on forecastsof drug usage.

Man vs Wild Data When you can’t lose 10

Page 30: Ysc2013

Forecasting the PBS

Man vs Wild Data When you can’t lose 11

Page 31: Ysc2013

Forecasting the PBS

In 2001: $4.5 billion budget, under-forecastedby $800 million.

Thousands of products. Seasonal demand.

Subject to covert marketing, volatile products,uncontrollable expenditure.

Although monthly data available for 10 years,data are aggregated to annual values, and onlythe first three years are used in estimating theforecasts.

All forecasts being done with the FORECASTfunction in MS-Excel!

Man vs Wild Data When you can’t lose 12

Page 32: Ysc2013

Forecasting the PBS

In 2001: $4.5 billion budget, under-forecastedby $800 million.

Thousands of products. Seasonal demand.

Subject to covert marketing, volatile products,uncontrollable expenditure.

Although monthly data available for 10 years,data are aggregated to annual values, and onlythe first three years are used in estimating theforecasts.

All forecasts being done with the FORECASTfunction in MS-Excel!

Man vs Wild Data When you can’t lose 12

Page 33: Ysc2013

Forecasting the PBS

In 2001: $4.5 billion budget, under-forecastedby $800 million.

Thousands of products. Seasonal demand.

Subject to covert marketing, volatile products,uncontrollable expenditure.

Although monthly data available for 10 years,data are aggregated to annual values, and onlythe first three years are used in estimating theforecasts.

All forecasts being done with the FORECASTfunction in MS-Excel!

Man vs Wild Data When you can’t lose 12

Page 34: Ysc2013

Forecasting the PBS

In 2001: $4.5 billion budget, under-forecastedby $800 million.

Thousands of products. Seasonal demand.

Subject to covert marketing, volatile products,uncontrollable expenditure.

Although monthly data available for 10 years,data are aggregated to annual values, and onlythe first three years are used in estimating theforecasts.

All forecasts being done with the FORECASTfunction in MS-Excel!

Man vs Wild Data When you can’t lose 12

Page 35: Ysc2013

Forecasting the PBS

In 2001: $4.5 billion budget, under-forecastedby $800 million.

Thousands of products. Seasonal demand.

Subject to covert marketing, volatile products,uncontrollable expenditure.

Although monthly data available for 10 years,data are aggregated to annual values, and onlythe first three years are used in estimating theforecasts.

All forecasts being done with the FORECASTfunction in MS-Excel!

Man vs Wild Data When you can’t lose 12

Page 36: Ysc2013

ATC drug classificationA Alimentary tract and metabolismB Blood and blood forming organsC Cardiovascular systemD DermatologicalsG Genito-urinary system and sex hormonesH Systemic hormonal preparations, excluding sex hor-

mones and insulinsJ Anti-infectives for systemic useL Antineoplastic and immunomodulating agentsM Musculo-skeletal systemN Nervous systemP Antiparasitic products, insecticides and repellentsR Respiratory systemS Sensory organsV Various

Man vs Wild Data When you can’t lose 13

Page 37: Ysc2013

ATC drug classification

A Alimentary tract and metabolism14 classes

A10 Drugs used in diabetes84 classes

A10B Blood glucose lowering drugs

A10BA Biguanides

A10BA02 Metformin

Man vs Wild Data When you can’t lose 14

Page 38: Ysc2013

Forecasting the PBS

Monthly data on thousands of drug groups and 4concession types available from 1991.

Method needs to be automated and implementedwithin MS-Excel.

Exponential smoothing seems appropriate (monthlydata with changing trends and seasonal patterns),but in 2001, automated exponential smoothing wasnot well-developed, and not available in MS-Excel.

As part of this project, we developed an automaticforecasting algorithm for exponential smoothingstate space models based on the AIC.

Forecast MAPE reduced from 15–20% to about 0.6%.

Man vs Wild Data When you can’t lose 15

Page 39: Ysc2013

Forecasting the PBS

Monthly data on thousands of drug groups and 4concession types available from 1991.

Method needs to be automated and implementedwithin MS-Excel.

Exponential smoothing seems appropriate (monthlydata with changing trends and seasonal patterns),but in 2001, automated exponential smoothing wasnot well-developed, and not available in MS-Excel.

As part of this project, we developed an automaticforecasting algorithm for exponential smoothingstate space models based on the AIC.

Forecast MAPE reduced from 15–20% to about 0.6%.

Man vs Wild Data When you can’t lose 15

Page 40: Ysc2013

Forecasting the PBS

Monthly data on thousands of drug groups and 4concession types available from 1991.

Method needs to be automated and implementedwithin MS-Excel.

Exponential smoothing seems appropriate (monthlydata with changing trends and seasonal patterns),but in 2001, automated exponential smoothing wasnot well-developed, and not available in MS-Excel.

As part of this project, we developed an automaticforecasting algorithm for exponential smoothingstate space models based on the AIC.

Forecast MAPE reduced from 15–20% to about 0.6%.

Man vs Wild Data When you can’t lose 15

Page 41: Ysc2013

Forecasting the PBS

Monthly data on thousands of drug groups and 4concession types available from 1991.

Method needs to be automated and implementedwithin MS-Excel.

Exponential smoothing seems appropriate (monthlydata with changing trends and seasonal patterns),but in 2001, automated exponential smoothing wasnot well-developed, and not available in MS-Excel.

As part of this project, we developed an automaticforecasting algorithm for exponential smoothingstate space models based on the AIC.

Forecast MAPE reduced from 15–20% to about 0.6%.

Man vs Wild Data When you can’t lose 15

Page 42: Ysc2013

Forecasting the PBS

Monthly data on thousands of drug groups and 4concession types available from 1991.

Method needs to be automated and implementedwithin MS-Excel.

Exponential smoothing seems appropriate (monthlydata with changing trends and seasonal patterns),but in 2001, automated exponential smoothing wasnot well-developed, and not available in MS-Excel.

As part of this project, we developed an automaticforecasting algorithm for exponential smoothingstate space models based on the AIC.

Forecast MAPE reduced from 15–20% to about 0.6%.

Man vs Wild Data When you can’t lose 15

Page 43: Ysc2013

Forecasting the PBS

Man vs Wild Data When you can’t lose 16

Total cost: A03 concession safety net group

$ th

ousa

nds

1995 2000 2005 2010

020

040

060

080

010

0012

00

Page 44: Ysc2013

Forecasting the PBS

Man vs Wild Data When you can’t lose 16

Total cost: A05 general copayments group

$ th

ousa

nds

1995 2000 2005 2010

050

100

150

200

250

Page 45: Ysc2013

Forecasting the PBS

Man vs Wild Data When you can’t lose 16

Total cost: D01 general copayments group

$ th

ousa

nds

1995 2000 2005 2010

010

020

030

040

050

060

070

0

Page 46: Ysc2013

Forecasting the PBS

Man vs Wild Data When you can’t lose 16

Total cost: S01 general copayments group

$ th

ousa

nds

1995 2000 2005 2010

010

0020

0030

0040

0050

0060

00

Page 47: Ysc2013

Forecasting the PBS

Man vs Wild Data When you can’t lose 16

Total cost: R03 general copayments group

$ th

ousa

nds

1995 2000 2005 2010

1000

2000

3000

4000

5000

6000

7000

Page 48: Ysc2013

Forecasting the PBS

Man vs Wild Data When you can’t lose 16

Total cost: R03 general copayments group

$ th

ousa

nds

1995 2000 2005 2010

1000

2000

3000

4000

5000

6000

7000

Some lessonsOften what people do is very bad, andit is easy to make a big difference.

Sometimes you have to invent newmethods, and that can lead topublications.

You have to implement solutions in theclient’s software environment.

Be aware of the politics.

Page 49: Ysc2013

Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts

Man vs Wild Data Getting dirty with data 17

Page 50: Ysc2013

Airline passenger traffic

Man vs Wild Data Getting dirty with data 18

Page 51: Ysc2013

Airline passenger traffic

Man vs Wild Data Getting dirty with data 19

First class passengers: Melbourne−Sydney

Year

1988 1989 1990 1991 1992 1993

0.0

1.0

2.0

Business class passengers: Melbourne−Sydney

Year

1988 1989 1990 1991 1992 1993

02

46

8

Economy class passengers: Melbourne−Sydney

Year

1988 1989 1990 1991 1992 1993

010

2030

Page 52: Ysc2013

Airline passenger traffic

Man vs Wild Data Getting dirty with data 19

First class passengers: Melbourne−Sydney

Year

1988 1989 1990 1991 1992 1993

0.0

1.0

2.0

Business class passengers: Melbourne−Sydney

Year

1988 1989 1990 1991 1992 1993

02

46

8

Economy class passengers: Melbourne−Sydney

Year

1988 1989 1990 1991 1992 1993

010

2030

Not the real data!Or is it?

Page 53: Ysc2013

Airline passenger traffic

Man vs Wild Data Getting dirty with data 20

Economy Class Passengers: Melbourne−Sydney

Pas

seng

ers

(tho

usan

ds)

1988 1989 1990 1991 1992 1993

05

1015

2025

3035

Page 54: Ysc2013

Airline passenger traffic

Man vs Wild Data Getting dirty with data 20

Economy Class Passengers: Melbourne−Sydney

Pas

seng

ers

(tho

usan

ds)

1988 1989 1990 1991 1992 1993

05

1015

2025

3035

Page 55: Ysc2013

Airline passenger traffic

Man vs Wild Data Getting dirty with data 20

Economy Class Passengers: Melbourne−Sydney

Pas

seng

ers

(tho

usan

ds)

1988 1989 1990 1991 1992 1993

05

1015

2025

3035

Page 56: Ysc2013

Possible modelYt = Y∗t + Zt

Y∗t = β0 +∑j

βjxt,j + Nt

Yt = observed data for one passenger class.Y∗t = reconstructed data.Zt = latent process (usually equal to zero).xt,j are covariates and dummy variables.Nt = seasonal ARIMA process of period 52.

Man vs Wild Data Getting dirty with data 21

Page 57: Ysc2013

Possible modelYt = Y∗t + Zt

Y∗t = β0 +∑j

βjxt,j + Nt

Yt = observed data for one passenger class.Y∗t = reconstructed data.Zt = latent process (usually equal to zero).xt,j are covariates and dummy variables.Nt = seasonal ARIMA process of period 52.

Man vs Wild Data Getting dirty with data 21

Some lessonsReal data is often very messy. Beaware of the causes.

Get an answer even if it isn’t pretty.

What to do with the non-integerseasonality? (average 52.19)

How to deal with the correlationsbetween classes and between routes?

You often think of better approacheslong after the project is finished.

Page 58: Ysc2013

Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts

Man vs Wild Data Going to extremes 22

Page 59: Ysc2013

Extreme electricity demand

Man vs Wild Data Going to extremes 23

Page 60: Ysc2013

The problem

We want to forecast the peak electricitydemand in a half-hour period in ten years time.

We have twelve years of half-hourly electricitydata, temperature data and some economicand demographic data.

The location is South Australia: home to themost volatile electricity demand in the world.

Sounds impossible?

Man vs Wild Data Going to extremes 24

Page 61: Ysc2013

The problem

We want to forecast the peak electricitydemand in a half-hour period in ten years time.

We have twelve years of half-hourly electricitydata, temperature data and some economicand demographic data.

The location is South Australia: home to themost volatile electricity demand in the world.

Sounds impossible?

Man vs Wild Data Going to extremes 24

Page 62: Ysc2013

The problem

We want to forecast the peak electricitydemand in a half-hour period in ten years time.

We have twelve years of half-hourly electricitydata, temperature data and some economicand demographic data.

The location is South Australia: home to themost volatile electricity demand in the world.

Sounds impossible?

Man vs Wild Data Going to extremes 24

Page 63: Ysc2013

The problem

We want to forecast the peak electricitydemand in a half-hour period in ten years time.

We have twelve years of half-hourly electricitydata, temperature data and some economicand demographic data.

The location is South Australia: home to themost volatile electricity demand in the world.

Sounds impossible?

Man vs Wild Data Going to extremes 24

Page 64: Ysc2013

The problem

We want to forecast the peak electricitydemand in a half-hour period in ten years time.

We have twelve years of half-hourly electricitydata, temperature data and some economicand demographic data.

The location is South Australia: home to themost volatile electricity demand in the world.

Sounds impossible?

Man vs Wild Data Going to extremes 24

Page 65: Ysc2013

South Australian demand data

Man vs Wild Data Going to extremes 25

Page 66: Ysc2013

South Australian demand data

Man vs Wild Data Going to extremes 25

Black Saturday→

Page 67: Ysc2013

South Australian demand data

Man vs Wild Data Going to extremes 25

South Australia state wide demand (summer 10/11)

Sou

th A

ustr

alia

sta

te w

ide

dem

and

(GW

)

1.5

2.0

2.5

3.0

3.5

Oct 10 Nov 10 Dec 10 Jan 11 Feb 11 Mar 11

Page 68: Ysc2013

South Australian demand data

Man vs Wild Data Going to extremes 25

South Australia state wide demand (January 2011)

Date in January

Sou

th A

ustr

alia

n de

man

d (G

W)

1.5

2.0

2.5

3.0

3.5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 3111 13 15 17 19 21

Page 69: Ysc2013

Demand boxplots (Sth Aust)

Man vs Wild Data Going to extremes 26

Page 70: Ysc2013

Temperature data (Sth Aust)

Man vs Wild Data Going to extremes 27

Page 71: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

yt denotes per capita demand at time t (measured inhalf-hourly intervals) and p denotes the time of dayp = 1, . . . ,48;

hp(t) models all calendar effects;

fp(w1,t,w2,t) models all temperature effects where w1,t isa vector of recent temperatures at location 1 and w2,t isa vector of recent temperatures at location 2;

zj,t is a demographic or economic variable at time t

nt denotes the model error at time t.

Man vs Wild Data Going to extremes 28

Page 72: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

yt denotes per capita demand at time t (measured inhalf-hourly intervals) and p denotes the time of dayp = 1, . . . ,48;

hp(t) models all calendar effects;

fp(w1,t,w2,t) models all temperature effects where w1,t isa vector of recent temperatures at location 1 and w2,t isa vector of recent temperatures at location 2;

zj,t is a demographic or economic variable at time t

nt denotes the model error at time t.

Man vs Wild Data Going to extremes 28

Page 73: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

yt denotes per capita demand at time t (measured inhalf-hourly intervals) and p denotes the time of dayp = 1, . . . ,48;

hp(t) models all calendar effects;

fp(w1,t,w2,t) models all temperature effects where w1,t isa vector of recent temperatures at location 1 and w2,t isa vector of recent temperatures at location 2;

zj,t is a demographic or economic variable at time t

nt denotes the model error at time t.

Man vs Wild Data Going to extremes 28

Page 74: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

yt denotes per capita demand at time t (measured inhalf-hourly intervals) and p denotes the time of dayp = 1, . . . ,48;

hp(t) models all calendar effects;

fp(w1,t,w2,t) models all temperature effects where w1,t isa vector of recent temperatures at location 1 and w2,t isa vector of recent temperatures at location 2;

zj,t is a demographic or economic variable at time t

nt denotes the model error at time t.

Man vs Wild Data Going to extremes 28

Page 75: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

yt denotes per capita demand at time t (measured inhalf-hourly intervals) and p denotes the time of dayp = 1, . . . ,48;

hp(t) models all calendar effects;

fp(w1,t,w2,t) models all temperature effects where w1,t isa vector of recent temperatures at location 1 and w2,t isa vector of recent temperatures at location 2;

zj,t is a demographic or economic variable at time t

nt denotes the model error at time t.

Man vs Wild Data Going to extremes 28

Page 76: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:

hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p

`p(t) is “time of summer” effect (a regression spline);

αt,p is day of week effect;

βt,p is “holiday” effect;

γt,p New Year’s Eve effect;

δt,p is millennium effect;

Man vs Wild Data Going to extremes 29

Page 77: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:

hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p

`p(t) is “time of summer” effect (a regression spline);

αt,p is day of week effect;

βt,p is “holiday” effect;

γt,p New Year’s Eve effect;

δt,p is millennium effect;

Man vs Wild Data Going to extremes 29

Page 78: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:

hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p

`p(t) is “time of summer” effect (a regression spline);

αt,p is day of week effect;

βt,p is “holiday” effect;

γt,p New Year’s Eve effect;

δt,p is millennium effect;

Man vs Wild Data Going to extremes 29

Page 79: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:

hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p

`p(t) is “time of summer” effect (a regression spline);

αt,p is day of week effect;

βt,p is “holiday” effect;

γt,p New Year’s Eve effect;

δt,p is millennium effect;

Man vs Wild Data Going to extremes 29

Page 80: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:

hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p

`p(t) is “time of summer” effect (a regression spline);

αt,p is day of week effect;

βt,p is “holiday” effect;

γt,p New Year’s Eve effect;

δt,p is millennium effect;

Man vs Wild Data Going to extremes 29

Page 81: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:

hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p

`p(t) is “time of summer” effect (a regression spline);

αt,p is day of week effect;

βt,p is “holiday” effect;

γt,p New Year’s Eve effect;

δt,p is millennium effect;

Man vs Wild Data Going to extremes 29

Page 82: Ysc2013

Fitted results (Summer 3pm)

Man vs Wild Data Going to extremes 30

0 50 100 150

−0.

40.

00.

4

Day of summer

Effe

ct o

n de

man

d

Mon Tue Wed Thu Fri Sat Sun

−0.

40.

00.

4

Day of week

Effe

ct o

n de

man

d

Normal Day before Holiday Day after

−0.

40.

00.

4

Holiday

Effe

ct o

n de

man

d

Time: 3:00 pm

Page 83: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

fp(w1,t,w2,t) =6∑

k=0

[fk,p(xt−k) + gk,p(dt−k)

]+ qp(x+

t ) + rp(x−t ) + sp(x̄t)

+6∑j=1

[Fj,p(xt−48j) + Gj,p(dt−48j)

]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.

Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31

Page 84: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

fp(w1,t,w2,t) =6∑

k=0

[fk,p(xt−k) + gk,p(dt−k)

]+ qp(x+

t ) + rp(x−t ) + sp(x̄t)

+6∑j=1

[Fj,p(xt−48j) + Gj,p(dt−48j)

]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.

Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31

Page 85: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

fp(w1,t,w2,t) =6∑

k=0

[fk,p(xt−k) + gk,p(dt−k)

]+ qp(x+

t ) + rp(x−t ) + sp(x̄t)

+6∑j=1

[Fj,p(xt−48j) + Gj,p(dt−48j)

]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.

Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31

Page 86: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

fp(w1,t,w2,t) =6∑

k=0

[fk,p(xt−k) + gk,p(dt−k)

]+ qp(x+

t ) + rp(x−t ) + sp(x̄t)

+6∑j=1

[Fj,p(xt−48j) + Gj,p(dt−48j)

]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.

Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31

Page 87: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

fp(w1,t,w2,t) =6∑

k=0

[fk,p(xt−k) + gk,p(dt−k)

]+ qp(x+

t ) + rp(x−t ) + sp(x̄t)

+6∑j=1

[Fj,p(xt−48j) + Gj,p(dt−48j)

]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.

Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31

Page 88: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

fp(w1,t,w2,t) =6∑

k=0

[fk,p(xt−k) + gk,p(dt−k)

]+ qp(x+

t ) + rp(x−t ) + sp(x̄t)

+6∑j=1

[Fj,p(xt−48j) + Gj,p(dt−48j)

]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.

Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31

Page 89: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

fp(w1,t,w2,t) =6∑

k=0

[fk,p(xt−k) + gk,p(dt−k)

]+ qp(x+

t ) + rp(x−t ) + sp(x̄t)

+6∑j=1

[Fj,p(xt−48j) + Gj,p(dt−48j)

]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.

Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31

Page 90: Ysc2013

Fitted results (Summer 3pm)

Man vs Wild Data Going to extremes 32

10 20 30 40

−0.

4−

0.2

0.0

0.2

0.4

Temperature

Effe

ct o

n de

man

d

10 20 30 40

−0.

4−

0.2

0.0

0.2

0.4

Lag 1 temperature

Effe

ct o

n de

man

d

10 20 30 40

−0.

4−

0.2

0.0

0.2

0.4

Lag 2 temperature

Effe

ct o

n de

man

d

10 20 30 40

−0.

4−

0.2

0.0

0.2

0.4

Lag 3 temperature

Effe

ct o

n de

man

d

10 20 30 40

−0.

4−

0.2

0.0

0.2

0.4

Lag 1 day temperature

Effe

ct o

n de

man

d

10 15 20 25 30

−0.

4−

0.2

0.0

0.2

0.4

Last week average temp

Effe

ct o

n de

man

d

15 25 35

−0.

4−

0.2

0.0

0.2

0.4

Previous max temp

Effe

ct o

n de

man

d

10 15 20 25

−0.

4−

0.2

0.0

0.2

0.4

Previous min temp

Effe

ct o

n de

man

d

Time: 3:00 pm

Page 91: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

Same predictors used for all 48 models.Predictors chosen by cross-validation onsummer of 2007/2008 and 2009/2010.Each model is fitted to the data twice, firstexcluding the summer of 2009/2010 and thenexcluding the summer of 2010/2011. Theaverage out-of-sample MSE is calculated fromthe omitted data for the time periods12noon–8.30pm.

Man vs Wild Data Going to extremes 33

Page 92: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

Same predictors used for all 48 models.Predictors chosen by cross-validation onsummer of 2007/2008 and 2009/2010.Each model is fitted to the data twice, firstexcluding the summer of 2009/2010 and thenexcluding the summer of 2010/2011. Theaverage out-of-sample MSE is calculated fromthe omitted data for the time periods12noon–8.30pm.

Man vs Wild Data Going to extremes 33

Page 93: Ysc2013

Monash Electricity Forecasting Model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

Same predictors used for all 48 models.Predictors chosen by cross-validation onsummer of 2007/2008 and 2009/2010.Each model is fitted to the data twice, firstexcluding the summer of 2009/2010 and thenexcluding the summer of 2010/2011. Theaverage out-of-sample MSE is calculated fromthe omitted data for the time periods12noon–8.30pm.

Man vs Wild Data Going to extremes 33

Page 94: Ysc2013

Half-hourly modelsx x1 x2 x3 x4 x5 x6 x48 x96 x144 x192 x240 x288 d d1 d2 d3 d4 d5 d6 d48 d96 d144 d192 d240 d288 x+ x− x̄ dow hol dos MSE

1 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0372 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0343 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0314 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0275 • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0256 • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0207 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0258 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0269 • • • • • • • • • • • • • • • • • • • • • • • • • 1.035

10 • • • • • • • • • • • • • • • • • • • • • • • • 1.04411 • • • • • • • • • • • • • • • • • • • • • • • 1.05712 • • • • • • • • • • • • • • • • • • • • • • 1.07613 • • • • • • • • • • • • • • • • • • • • • 1.10214 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.01815 • • • • • • • • • • • • • • • • • • • • • • • • • 1.02116 • • • • • • • • • • • • • • • • • • • • • • • • 1.03717 • • • • • • • • • • • • • • • • • • • • • • • 1.07418 • • • • • • • • • • • • • • • • • • • • • • 1.15219 • • • • • • • • • • • • • • • • • • • • • 1.18020 • • • • • • • • • • • • • • • • • • • • • • • • • 1.02121 • • • • • • • • • • • • • • • • • • • • • • • • 1.02722 • • • • • • • • • • • • • • • • • • • • • • • 1.03823 • • • • • • • • • • • • • • • • • • • • • • 1.05624 • • • • • • • • • • • • • • • • • • • • • 1.08625 • • • • • • • • • • • • • • • • • • • • 1.13526 • • • • • • • • • • • • • • • • • • • • • • • • • 1.00927 • • • • • • • • • • • • • • • • • • • • • • • • • 1.06328 • • • • • • • • • • • • • • • • • • • • • • • • • 1.02829 • • • • • • • • • • • • • • • • • • • • • • • • • 3.52330 • • • • • • • • • • • • • • • • • • • • • • • • • 2.14331 • • • • • • • • • • • • • • • • • • • • • • • • • 1.523

Man vs Wild Data Going to extremes 34

Page 95: Ysc2013

Half-hourly models

Man vs Wild Data Going to extremes 35

6070

8090

R−squared

Time of day

R−

squa

red

(%)

12 midnight 6:00 am 9:00 am 12 noon 3:00 pm 6:00 pm 9:00 pm3:00 am 12 midnight

Page 96: Ysc2013

Half-hourly models

Man vs Wild Data Going to extremes 35

South Australian demand (January 2011)

Date in January

Sou

th A

ustr

alia

n de

man

d (G

W)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

ActualFitted

Temperatures (January 2011)

Date in January

Tem

pera

ture

(de

g C

)

1015

2025

3035

4045

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Kent TownAirport

Page 97: Ysc2013

Half-hourly models

Man vs Wild Data Going to extremes 35

Page 98: Ysc2013

Half-hourly models

Man vs Wild Data Going to extremes 35

Page 99: Ysc2013

Adjusted model

Original model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

Model allowing saturated usage

qt = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

log(yt) =

{qt if qt ≤ τ ;τ + k(qt − τ) if qt > τ .

Man vs Wild Data Going to extremes 36

Page 100: Ysc2013

Adjusted model

Original model

log(yt) = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

Model allowing saturated usage

qt = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

log(yt) =

{qt if qt ≤ τ ;τ + k(qt − τ) if qt > τ .

Man vs Wild Data Going to extremes 36

Page 101: Ysc2013

Peak demand forecasting

qt,p = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

Multiple alternative futures created:hp(t) known;simulate future temperatures using doubleseasonal block bootstrap with variableblocks (with adjustment for climate change);use assumed values for GSP, population andprice;resample residuals using double seasonal blockbootstrap with variable blocks.

Man vs Wild Data Going to extremes 37

Page 102: Ysc2013

Peak demand backcasting

qt,p = hp(t) + fp(w1,t,w2,t) +

J∑j=1

cjzj,t + nt

Multiple alternative pasts created:hp(t) known;simulate past temperatures using doubleseasonal block bootstrap with variableblocks;use actual values for GSP, population andprice;resample residuals using double seasonal blockbootstrap with variable blocks.

Man vs Wild Data Going to extremes 37

Page 103: Ysc2013

Peak demand backcasting

Man vs Wild Data Going to extremes 38

PoE (annual interpretation)

Year

PoE

Dem

and

2.0

2.5

3.0

3.5

4.0

98/99 00/01 02/03 04/05 06/07 08/09 10/11

10 %50 %90 %

●●

● ●

Page 104: Ysc2013

Peak demand forecasting

Man vs Wild Data Going to extremes 39

South Australia GSP

Year

billi

on d

olla

rs (

08/0

9 do

llars

)

1990 1995 2000 2005 2010 2015 2020

4060

8010

012

0

HighBaseLow

South Australia population

Year

mill

ion

1990 1995 2000 2005 2010 2015 2020

1.4

1.6

1.8

2.0

HighBaseLow

Average electricity prices

Year

c/kW

h

1990 1995 2000 2005 2010 2015 2020

1214

1618

2022

HighBaseLow

Major industrial offset demand

Year

MW

1990 1995 2000 2005 2010 2015 2020

010

020

030

040

0

HighBaseLow

Page 105: Ysc2013

Peak demand distribution

Man vs Wild Data Going to extremes 40

Annual POE levels

Year

PoE

Dem

and

23

45

6

98/99 00/01 02/03 04/05 06/07 08/09 10/11 12/13 14/15 16/17 18/19 20/21

●●

● ●

● ●

●●

1 % POE5 % POE10 % POE50 % POE90 % POEActual annual maximum

Page 106: Ysc2013

ResultsWe have successfully forecast the extreme upper tail inten years time using only twelve years of data!

This method has now been adopted for the officiallong-term peak electricity demand forecasts for all statesexcept WA.

Some lessonsCross-validation is very useful in predictionproblems.

Statistical modelling is an iterative process.

Getting client understanding of percentiles isextremely difficult.

Beware of clients who think they know morethan you!

Man vs Wild Data Going to extremes 41

Page 107: Ysc2013

ResultsWe have successfully forecast the extreme upper tail inten years time using only twelve years of data!

This method has now been adopted for the officiallong-term peak electricity demand forecasts for all statesexcept WA.

Some lessonsCross-validation is very useful in predictionproblems.

Statistical modelling is an iterative process.

Getting client understanding of percentiles isextremely difficult.

Beware of clients who think they know morethan you!

Man vs Wild Data Going to extremes 41

Page 108: Ysc2013

ResultsWe have successfully forecast the extreme upper tail inten years time using only twelve years of data!

This method has now been adopted for the officiallong-term peak electricity demand forecasts for all statesexcept WA.

Some lessonsCross-validation is very useful in predictionproblems.

Statistical modelling is an iterative process.

Getting client understanding of percentiles isextremely difficult.

Beware of clients who think they know morethan you!

Man vs Wild Data Going to extremes 41

Page 109: Ysc2013

ResultsWe have successfully forecast the extreme upper tail inten years time using only twelve years of data!

This method has now been adopted for the officiallong-term peak electricity demand forecasts for all statesexcept WA.

Some lessonsCross-validation is very useful in predictionproblems.

Statistical modelling is an iterative process.

Getting client understanding of percentiles isextremely difficult.

Beware of clients who think they know morethan you!

Man vs Wild Data Going to extremes 41

Page 110: Ysc2013

Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts

Man vs Wild Data Final thoughts 42

Page 111: Ysc2013

Crazy clients

The client who wouldn’t tell me the

problem.

The client who wanted all meetings

held at random locations for security

reasons.

The client who didn’t like the answer.

Expert witnessing on the color purple

(and now yellow).

Man vs Wild Data Final thoughts 43

Page 112: Ysc2013

Crazy clients

The client who wouldn’t tell me the

problem.

The client who wanted all meetings

held at random locations for security

reasons.

The client who didn’t like the answer.

Expert witnessing on the color purple

(and now yellow).

Man vs Wild Data Final thoughts 43

Page 113: Ysc2013

Crazy clients

The client who wouldn’t tell me the

problem.

The client who wanted all meetings

held at random locations for security

reasons.

The client who didn’t like the answer.

Expert witnessing on the color purple

(and now yellow).

Man vs Wild Data Final thoughts 43

Page 114: Ysc2013

Crazy clients

The client who wouldn’t tell me the

problem.

The client who wanted all meetings

held at random locations for security

reasons.

The client who didn’t like the answer.

Expert witnessing on the color purple

(and now yellow).

Man vs Wild Data Final thoughts 43

Page 115: Ysc2013

Go forth and consult

A good statistician is not smarter than

everyone else, he merely has his ignorance

better organised.

(Anonymous)

Man vs Wild Data Final thoughts 44

Page 116: Ysc2013

Go forth and consult

All models are wrong, some are useful.

(George E P Box)

Man vs Wild Data Final thoughts 44

Page 117: Ysc2013

Go forth and consult

It is better to solve the right problem the

wrong way than the wrong problem the

right way.

(John W Tukey)

Man vs Wild Data Final thoughts 44

Page 118: Ysc2013

Go forth and consult

It is better to solve the right problem the

wrong way than the wrong problem the

right way.

(John W Tukey)

Slides available from robjhyndman.com

Man vs Wild Data Final thoughts 44