Young Statisticians Conference 7 February 2013
YoungStatisticiansConference
7 February 2013
YoungStatisticiansConference
7 February 2013
YoungStatisticiansConference
7 February 2013
Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data Where fools fear to tread 2
My story
Man vs Wild Data Where fools fear to tread 3
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three universityconsulting services
Reviewing my ownwork
Six times an expertwitness
Hundreds of clients
My story
Man vs Wild Data Where fools fear to tread 3
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three universityconsulting services
Reviewing my ownwork
Six times an expertwitness
Hundreds of clients
My story
Man vs Wild Data Where fools fear to tread 3
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three universityconsulting services
Reviewing my ownwork
Six times an expertwitness
Hundreds of clients
My story
Man vs Wild Data Where fools fear to tread 3
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three universityconsulting services
Reviewing my ownwork
Six times an expertwitness
Hundreds of clients
My story
Man vs Wild Data Where fools fear to tread 3
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three universityconsulting services
Reviewing my ownwork
Six times an expertwitness
Hundreds of clients
My story
Man vs Wild Data Where fools fear to tread 3
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three universityconsulting services
Reviewing my ownwork
Six times an expertwitness
Hundreds of clients
My story
Man vs Wild Data Where fools fear to tread 3
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three universityconsulting services
Reviewing my ownwork
Six times an expertwitness
Hundreds of clients
My story
Man vs Wild Data Where fools fear to tread 3
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three universityconsulting services
Reviewing my ownwork
Six times an expertwitness
Hundreds of clients
Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data Working with inadequate tools 4
Disposable tableware company
Man vs Wild Data Working with inadequate tools 5
Problem: Want forecasts of each ofhundreds of items. Series can bestationary, trended or seasonal. Theycurrently have a large forecastingprogram written in-house but it doesn’tseem to produce sensible forecasts.They want me to tell them what iswrong and fix it.
Disposable tableware company
Man vs Wild Data Working with inadequate tools 5
Problem: Want forecasts of each ofhundreds of items. Series can bestationary, trended or seasonal. Theycurrently have a large forecastingprogram written in-house but it doesn’tseem to produce sensible forecasts.They want me to tell them what iswrong and fix it.
Additional informationProgram written in COBOL making numerical calculationslimited. It is not possible to do any optimisation.
Disposable tableware company
Man vs Wild Data Working with inadequate tools 5
Problem: Want forecasts of each ofhundreds of items. Series can bestationary, trended or seasonal. Theycurrently have a large forecastingprogram written in-house but it doesn’tseem to produce sensible forecasts.They want me to tell them what iswrong and fix it.
Additional informationProgram written in COBOL making numerical calculationslimited. It is not possible to do any optimisation.Their programmer has little experience in numericalcomputing.
Disposable tableware company
Man vs Wild Data Working with inadequate tools 5
Problem: Want forecasts of each ofhundreds of items. Series can bestationary, trended or seasonal. Theycurrently have a large forecastingprogram written in-house but it doesn’tseem to produce sensible forecasts.They want me to tell them what iswrong and fix it.
Additional informationProgram written in COBOL making numerical calculationslimited. It is not possible to do any optimisation.Their programmer has little experience in numericalcomputing.They employ no statisticians and want the program toproduce forecasts automatically.
Disposable tableware company
Methods currently used
A 12 month average
C 6 month average
E straight line regression over last 12 months
G straight line regression over last 6 months
H average slope between last year’s and thisyear’s values.(Equivalent to differencing at lag 12 andtaking mean.)
I Same as H except over 6 months.
K I couldn’t understand the explanation.
Man vs Wild Data Working with inadequate tools 6
Disposable tableware company
My solution
Use first differencing to deal with trend, or seasonaldifferencing to deal with seasonality.
Use simple exponential smoothing on (differenced)data with the parameter selected from{0.1,0.3,0.5,0.7,0.9}.For each series, try 15 models: no differencing, firstdifferencing, and seasonal differencing, plus SESwith 5 parameter values.
Model selected based on smallest MSE. (Only oneparameter for each model, so no need to penalizefor model size.)
Man vs Wild Data Working with inadequate tools 7
Disposable tableware company
My solution
Use first differencing to deal with trend, or seasonaldifferencing to deal with seasonality.
Use simple exponential smoothing on (differenced)data with the parameter selected from{0.1,0.3,0.5,0.7,0.9}.For each series, try 15 models: no differencing, firstdifferencing, and seasonal differencing, plus SESwith 5 parameter values.
Model selected based on smallest MSE. (Only oneparameter for each model, so no need to penalizefor model size.)
Man vs Wild Data Working with inadequate tools 7
Disposable tableware company
My solution
Use first differencing to deal with trend, or seasonaldifferencing to deal with seasonality.
Use simple exponential smoothing on (differenced)data with the parameter selected from{0.1,0.3,0.5,0.7,0.9}.For each series, try 15 models: no differencing, firstdifferencing, and seasonal differencing, plus SESwith 5 parameter values.
Model selected based on smallest MSE. (Only oneparameter for each model, so no need to penalizefor model size.)
Man vs Wild Data Working with inadequate tools 7
Disposable tableware company
My solution
Use first differencing to deal with trend, or seasonaldifferencing to deal with seasonality.
Use simple exponential smoothing on (differenced)data with the parameter selected from{0.1,0.3,0.5,0.7,0.9}.For each series, try 15 models: no differencing, firstdifferencing, and seasonal differencing, plus SESwith 5 parameter values.
Model selected based on smallest MSE. (Only oneparameter for each model, so no need to penalizefor model size.)
Man vs Wild Data Working with inadequate tools 7
Disposable tableware company
My solution
Use first differencing to deal with trend, or seasonaldifferencing to deal with seasonality.
Use simple exponential smoothing on (differenced)data with the parameter selected from{0.1,0.3,0.5,0.7,0.9}.For each series, try 15 models: no differencing, firstdifferencing, and seasonal differencing, plus SESwith 5 parameter values.
Model selected based on smallest MSE. (Only oneparameter for each model, so no need to penalizefor model size.)
Man vs Wild Data Working with inadequate tools 7
Some lessonsBe pragmatic.
Understand your tools well enough
to be able to adapt them.
A successful consulting job often
uses very simple methods.
Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data When you can’t lose 8
Forecasting the PBS
Man vs Wild Data When you can’t lose 9
Forecasting the PBS
The Pharmaceutical Benefits Scheme (PBS) isthe Australian government drugs subsidy scheme.
Many drugs bought from pharmacies aresubsidised to allow more equitable access tomodern drugs.
The cost to government is determined by thenumber and types of drugs purchased.Currently nearly 1% of GDP.
The total cost is budgeted based on forecastsof drug usage.
Man vs Wild Data When you can’t lose 10
Forecasting the PBS
The Pharmaceutical Benefits Scheme (PBS) isthe Australian government drugs subsidy scheme.
Many drugs bought from pharmacies aresubsidised to allow more equitable access tomodern drugs.
The cost to government is determined by thenumber and types of drugs purchased.Currently nearly 1% of GDP.
The total cost is budgeted based on forecastsof drug usage.
Man vs Wild Data When you can’t lose 10
Forecasting the PBS
The Pharmaceutical Benefits Scheme (PBS) isthe Australian government drugs subsidy scheme.
Many drugs bought from pharmacies aresubsidised to allow more equitable access tomodern drugs.
The cost to government is determined by thenumber and types of drugs purchased.Currently nearly 1% of GDP.
The total cost is budgeted based on forecastsof drug usage.
Man vs Wild Data When you can’t lose 10
Forecasting the PBS
The Pharmaceutical Benefits Scheme (PBS) isthe Australian government drugs subsidy scheme.
Many drugs bought from pharmacies aresubsidised to allow more equitable access tomodern drugs.
The cost to government is determined by thenumber and types of drugs purchased.Currently nearly 1% of GDP.
The total cost is budgeted based on forecastsof drug usage.
Man vs Wild Data When you can’t lose 10
Forecasting the PBS
Man vs Wild Data When you can’t lose 11
Forecasting the PBS
In 2001: $4.5 billion budget, under-forecastedby $800 million.
Thousands of products. Seasonal demand.
Subject to covert marketing, volatile products,uncontrollable expenditure.
Although monthly data available for 10 years,data are aggregated to annual values, and onlythe first three years are used in estimating theforecasts.
All forecasts being done with the FORECASTfunction in MS-Excel!
Man vs Wild Data When you can’t lose 12
Forecasting the PBS
In 2001: $4.5 billion budget, under-forecastedby $800 million.
Thousands of products. Seasonal demand.
Subject to covert marketing, volatile products,uncontrollable expenditure.
Although monthly data available for 10 years,data are aggregated to annual values, and onlythe first three years are used in estimating theforecasts.
All forecasts being done with the FORECASTfunction in MS-Excel!
Man vs Wild Data When you can’t lose 12
Forecasting the PBS
In 2001: $4.5 billion budget, under-forecastedby $800 million.
Thousands of products. Seasonal demand.
Subject to covert marketing, volatile products,uncontrollable expenditure.
Although monthly data available for 10 years,data are aggregated to annual values, and onlythe first three years are used in estimating theforecasts.
All forecasts being done with the FORECASTfunction in MS-Excel!
Man vs Wild Data When you can’t lose 12
Forecasting the PBS
In 2001: $4.5 billion budget, under-forecastedby $800 million.
Thousands of products. Seasonal demand.
Subject to covert marketing, volatile products,uncontrollable expenditure.
Although monthly data available for 10 years,data are aggregated to annual values, and onlythe first three years are used in estimating theforecasts.
All forecasts being done with the FORECASTfunction in MS-Excel!
Man vs Wild Data When you can’t lose 12
Forecasting the PBS
In 2001: $4.5 billion budget, under-forecastedby $800 million.
Thousands of products. Seasonal demand.
Subject to covert marketing, volatile products,uncontrollable expenditure.
Although monthly data available for 10 years,data are aggregated to annual values, and onlythe first three years are used in estimating theforecasts.
All forecasts being done with the FORECASTfunction in MS-Excel!
Man vs Wild Data When you can’t lose 12
ATC drug classificationA Alimentary tract and metabolismB Blood and blood forming organsC Cardiovascular systemD DermatologicalsG Genito-urinary system and sex hormonesH Systemic hormonal preparations, excluding sex hor-
mones and insulinsJ Anti-infectives for systemic useL Antineoplastic and immunomodulating agentsM Musculo-skeletal systemN Nervous systemP Antiparasitic products, insecticides and repellentsR Respiratory systemS Sensory organsV Various
Man vs Wild Data When you can’t lose 13
ATC drug classification
A Alimentary tract and metabolism14 classes
A10 Drugs used in diabetes84 classes
A10B Blood glucose lowering drugs
A10BA Biguanides
A10BA02 Metformin
Man vs Wild Data When you can’t lose 14
Forecasting the PBS
Monthly data on thousands of drug groups and 4concession types available from 1991.
Method needs to be automated and implementedwithin MS-Excel.
Exponential smoothing seems appropriate (monthlydata with changing trends and seasonal patterns),but in 2001, automated exponential smoothing wasnot well-developed, and not available in MS-Excel.
As part of this project, we developed an automaticforecasting algorithm for exponential smoothingstate space models based on the AIC.
Forecast MAPE reduced from 15–20% to about 0.6%.
Man vs Wild Data When you can’t lose 15
Forecasting the PBS
Monthly data on thousands of drug groups and 4concession types available from 1991.
Method needs to be automated and implementedwithin MS-Excel.
Exponential smoothing seems appropriate (monthlydata with changing trends and seasonal patterns),but in 2001, automated exponential smoothing wasnot well-developed, and not available in MS-Excel.
As part of this project, we developed an automaticforecasting algorithm for exponential smoothingstate space models based on the AIC.
Forecast MAPE reduced from 15–20% to about 0.6%.
Man vs Wild Data When you can’t lose 15
Forecasting the PBS
Monthly data on thousands of drug groups and 4concession types available from 1991.
Method needs to be automated and implementedwithin MS-Excel.
Exponential smoothing seems appropriate (monthlydata with changing trends and seasonal patterns),but in 2001, automated exponential smoothing wasnot well-developed, and not available in MS-Excel.
As part of this project, we developed an automaticforecasting algorithm for exponential smoothingstate space models based on the AIC.
Forecast MAPE reduced from 15–20% to about 0.6%.
Man vs Wild Data When you can’t lose 15
Forecasting the PBS
Monthly data on thousands of drug groups and 4concession types available from 1991.
Method needs to be automated and implementedwithin MS-Excel.
Exponential smoothing seems appropriate (monthlydata with changing trends and seasonal patterns),but in 2001, automated exponential smoothing wasnot well-developed, and not available in MS-Excel.
As part of this project, we developed an automaticforecasting algorithm for exponential smoothingstate space models based on the AIC.
Forecast MAPE reduced from 15–20% to about 0.6%.
Man vs Wild Data When you can’t lose 15
Forecasting the PBS
Monthly data on thousands of drug groups and 4concession types available from 1991.
Method needs to be automated and implementedwithin MS-Excel.
Exponential smoothing seems appropriate (monthlydata with changing trends and seasonal patterns),but in 2001, automated exponential smoothing wasnot well-developed, and not available in MS-Excel.
As part of this project, we developed an automaticforecasting algorithm for exponential smoothingstate space models based on the AIC.
Forecast MAPE reduced from 15–20% to about 0.6%.
Man vs Wild Data When you can’t lose 15
Forecasting the PBS
Man vs Wild Data When you can’t lose 16
Total cost: A03 concession safety net group
$ th
ousa
nds
1995 2000 2005 2010
020
040
060
080
010
0012
00
Forecasting the PBS
Man vs Wild Data When you can’t lose 16
Total cost: A05 general copayments group
$ th
ousa
nds
1995 2000 2005 2010
050
100
150
200
250
Forecasting the PBS
Man vs Wild Data When you can’t lose 16
Total cost: D01 general copayments group
$ th
ousa
nds
1995 2000 2005 2010
010
020
030
040
050
060
070
0
Forecasting the PBS
Man vs Wild Data When you can’t lose 16
Total cost: S01 general copayments group
$ th
ousa
nds
1995 2000 2005 2010
010
0020
0030
0040
0050
0060
00
Forecasting the PBS
Man vs Wild Data When you can’t lose 16
Total cost: R03 general copayments group
$ th
ousa
nds
1995 2000 2005 2010
1000
2000
3000
4000
5000
6000
7000
Forecasting the PBS
Man vs Wild Data When you can’t lose 16
Total cost: R03 general copayments group
$ th
ousa
nds
1995 2000 2005 2010
1000
2000
3000
4000
5000
6000
7000
Some lessonsOften what people do is very bad, andit is easy to make a big difference.
Sometimes you have to invent newmethods, and that can lead topublications.
You have to implement solutions in theclient’s software environment.
Be aware of the politics.
Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data Getting dirty with data 17
Airline passenger traffic
Man vs Wild Data Getting dirty with data 18
Airline passenger traffic
Man vs Wild Data Getting dirty with data 19
First class passengers: Melbourne−Sydney
Year
1988 1989 1990 1991 1992 1993
0.0
1.0
2.0
Business class passengers: Melbourne−Sydney
Year
1988 1989 1990 1991 1992 1993
02
46
8
Economy class passengers: Melbourne−Sydney
Year
1988 1989 1990 1991 1992 1993
010
2030
Airline passenger traffic
Man vs Wild Data Getting dirty with data 19
First class passengers: Melbourne−Sydney
Year
1988 1989 1990 1991 1992 1993
0.0
1.0
2.0
Business class passengers: Melbourne−Sydney
Year
1988 1989 1990 1991 1992 1993
02
46
8
Economy class passengers: Melbourne−Sydney
Year
1988 1989 1990 1991 1992 1993
010
2030
Not the real data!Or is it?
Airline passenger traffic
Man vs Wild Data Getting dirty with data 20
Economy Class Passengers: Melbourne−Sydney
Pas
seng
ers
(tho
usan
ds)
1988 1989 1990 1991 1992 1993
05
1015
2025
3035
Airline passenger traffic
Man vs Wild Data Getting dirty with data 20
Economy Class Passengers: Melbourne−Sydney
Pas
seng
ers
(tho
usan
ds)
1988 1989 1990 1991 1992 1993
05
1015
2025
3035
Airline passenger traffic
Man vs Wild Data Getting dirty with data 20
Economy Class Passengers: Melbourne−Sydney
Pas
seng
ers
(tho
usan
ds)
1988 1989 1990 1991 1992 1993
05
1015
2025
3035
Possible modelYt = Y∗t + Zt
Y∗t = β0 +∑j
βjxt,j + Nt
Yt = observed data for one passenger class.Y∗t = reconstructed data.Zt = latent process (usually equal to zero).xt,j are covariates and dummy variables.Nt = seasonal ARIMA process of period 52.
Man vs Wild Data Getting dirty with data 21
Possible modelYt = Y∗t + Zt
Y∗t = β0 +∑j
βjxt,j + Nt
Yt = observed data for one passenger class.Y∗t = reconstructed data.Zt = latent process (usually equal to zero).xt,j are covariates and dummy variables.Nt = seasonal ARIMA process of period 52.
Man vs Wild Data Getting dirty with data 21
Some lessonsReal data is often very messy. Beaware of the causes.
Get an answer even if it isn’t pretty.
What to do with the non-integerseasonality? (average 52.19)
How to deal with the correlationsbetween classes and between routes?
You often think of better approacheslong after the project is finished.
Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data Going to extremes 22
Extreme electricity demand
Man vs Wild Data Going to extremes 23
The problem
We want to forecast the peak electricitydemand in a half-hour period in ten years time.
We have twelve years of half-hourly electricitydata, temperature data and some economicand demographic data.
The location is South Australia: home to themost volatile electricity demand in the world.
Sounds impossible?
Man vs Wild Data Going to extremes 24
The problem
We want to forecast the peak electricitydemand in a half-hour period in ten years time.
We have twelve years of half-hourly electricitydata, temperature data and some economicand demographic data.
The location is South Australia: home to themost volatile electricity demand in the world.
Sounds impossible?
Man vs Wild Data Going to extremes 24
The problem
We want to forecast the peak electricitydemand in a half-hour period in ten years time.
We have twelve years of half-hourly electricitydata, temperature data and some economicand demographic data.
The location is South Australia: home to themost volatile electricity demand in the world.
Sounds impossible?
Man vs Wild Data Going to extremes 24
The problem
We want to forecast the peak electricitydemand in a half-hour period in ten years time.
We have twelve years of half-hourly electricitydata, temperature data and some economicand demographic data.
The location is South Australia: home to themost volatile electricity demand in the world.
Sounds impossible?
Man vs Wild Data Going to extremes 24
The problem
We want to forecast the peak electricitydemand in a half-hour period in ten years time.
We have twelve years of half-hourly electricitydata, temperature data and some economicand demographic data.
The location is South Australia: home to themost volatile electricity demand in the world.
Sounds impossible?
Man vs Wild Data Going to extremes 24
South Australian demand data
Man vs Wild Data Going to extremes 25
South Australian demand data
Man vs Wild Data Going to extremes 25
Black Saturday→
South Australian demand data
Man vs Wild Data Going to extremes 25
South Australia state wide demand (summer 10/11)
Sou
th A
ustr
alia
sta
te w
ide
dem
and
(GW
)
1.5
2.0
2.5
3.0
3.5
Oct 10 Nov 10 Dec 10 Jan 11 Feb 11 Mar 11
South Australian demand data
Man vs Wild Data Going to extremes 25
South Australia state wide demand (January 2011)
Date in January
Sou
th A
ustr
alia
n de
man
d (G
W)
1.5
2.0
2.5
3.0
3.5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 3111 13 15 17 19 21
Demand boxplots (Sth Aust)
Man vs Wild Data Going to extremes 26
Temperature data (Sth Aust)
Man vs Wild Data Going to extremes 27
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
yt denotes per capita demand at time t (measured inhalf-hourly intervals) and p denotes the time of dayp = 1, . . . ,48;
hp(t) models all calendar effects;
fp(w1,t,w2,t) models all temperature effects where w1,t isa vector of recent temperatures at location 1 and w2,t isa vector of recent temperatures at location 2;
zj,t is a demographic or economic variable at time t
nt denotes the model error at time t.
Man vs Wild Data Going to extremes 28
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
yt denotes per capita demand at time t (measured inhalf-hourly intervals) and p denotes the time of dayp = 1, . . . ,48;
hp(t) models all calendar effects;
fp(w1,t,w2,t) models all temperature effects where w1,t isa vector of recent temperatures at location 1 and w2,t isa vector of recent temperatures at location 2;
zj,t is a demographic or economic variable at time t
nt denotes the model error at time t.
Man vs Wild Data Going to extremes 28
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
yt denotes per capita demand at time t (measured inhalf-hourly intervals) and p denotes the time of dayp = 1, . . . ,48;
hp(t) models all calendar effects;
fp(w1,t,w2,t) models all temperature effects where w1,t isa vector of recent temperatures at location 1 and w2,t isa vector of recent temperatures at location 2;
zj,t is a demographic or economic variable at time t
nt denotes the model error at time t.
Man vs Wild Data Going to extremes 28
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
yt denotes per capita demand at time t (measured inhalf-hourly intervals) and p denotes the time of dayp = 1, . . . ,48;
hp(t) models all calendar effects;
fp(w1,t,w2,t) models all temperature effects where w1,t isa vector of recent temperatures at location 1 and w2,t isa vector of recent temperatures at location 2;
zj,t is a demographic or economic variable at time t
nt denotes the model error at time t.
Man vs Wild Data Going to extremes 28
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
yt denotes per capita demand at time t (measured inhalf-hourly intervals) and p denotes the time of dayp = 1, . . . ,48;
hp(t) models all calendar effects;
fp(w1,t,w2,t) models all temperature effects where w1,t isa vector of recent temperatures at location 1 and w2,t isa vector of recent temperatures at location 2;
zj,t is a demographic or economic variable at time t
nt denotes the model error at time t.
Man vs Wild Data Going to extremes 28
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:
hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p
`p(t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:
hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p
`p(t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:
hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p
`p(t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:
hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p
`p(t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:
hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p
`p(t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
hp(t) includes handle annual, weekly and daily seasonalpatterns as well as public holidays:
hp(t) = `p(t) + αt,p + βt,p + γt,p + δt,p
`p(t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
Fitted results (Summer 3pm)
Man vs Wild Data Going to extremes 30
0 50 100 150
−0.
40.
00.
4
Day of summer
Effe
ct o
n de
man
d
Mon Tue Wed Thu Fri Sat Sun
−0.
40.
00.
4
Day of week
Effe
ct o
n de
man
d
Normal Day before Holiday Day after
−0.
40.
00.
4
Holiday
Effe
ct o
n de
man
d
Time: 3:00 pm
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
fp(w1,t,w2,t) =6∑
k=0
[fk,p(xt−k) + gk,p(dt−k)
]+ qp(x+
t ) + rp(x−t ) + sp(x̄t)
+6∑j=1
[Fj,p(xt−48j) + Gj,p(dt−48j)
]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.
Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
fp(w1,t,w2,t) =6∑
k=0
[fk,p(xt−k) + gk,p(dt−k)
]+ qp(x+
t ) + rp(x−t ) + sp(x̄t)
+6∑j=1
[Fj,p(xt−48j) + Gj,p(dt−48j)
]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.
Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
fp(w1,t,w2,t) =6∑
k=0
[fk,p(xt−k) + gk,p(dt−k)
]+ qp(x+
t ) + rp(x−t ) + sp(x̄t)
+6∑j=1
[Fj,p(xt−48j) + Gj,p(dt−48j)
]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.
Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
fp(w1,t,w2,t) =6∑
k=0
[fk,p(xt−k) + gk,p(dt−k)
]+ qp(x+
t ) + rp(x−t ) + sp(x̄t)
+6∑j=1
[Fj,p(xt−48j) + Gj,p(dt−48j)
]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.
Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
fp(w1,t,w2,t) =6∑
k=0
[fk,p(xt−k) + gk,p(dt−k)
]+ qp(x+
t ) + rp(x−t ) + sp(x̄t)
+6∑j=1
[Fj,p(xt−48j) + Gj,p(dt−48j)
]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.
Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
fp(w1,t,w2,t) =6∑
k=0
[fk,p(xt−k) + gk,p(dt−k)
]+ qp(x+
t ) + rp(x−t ) + sp(x̄t)
+6∑j=1
[Fj,p(xt−48j) + Gj,p(dt−48j)
]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.
Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
fp(w1,t,w2,t) =6∑
k=0
[fk,p(xt−k) + gk,p(dt−k)
]+ qp(x+
t ) + rp(x−t ) + sp(x̄t)
+6∑j=1
[Fj,p(xt−48j) + Gj,p(dt−48j)
]xt is ave temp across two sites (Kent Town and AdelaideAirport) at time t;dt is the temp difference between two sites at time t;x+t is max of xt values in past 24 hours;x−t is min of xt values in past 24 hours;x̄t is ave temp in past seven days.
Each function is smooth & estimated using regression splines.Man vs Wild Data Going to extremes 31
Fitted results (Summer 3pm)
Man vs Wild Data Going to extremes 32
10 20 30 40
−0.
4−
0.2
0.0
0.2
0.4
Temperature
Effe
ct o
n de
man
d
10 20 30 40
−0.
4−
0.2
0.0
0.2
0.4
Lag 1 temperature
Effe
ct o
n de
man
d
10 20 30 40
−0.
4−
0.2
0.0
0.2
0.4
Lag 2 temperature
Effe
ct o
n de
man
d
10 20 30 40
−0.
4−
0.2
0.0
0.2
0.4
Lag 3 temperature
Effe
ct o
n de
man
d
10 20 30 40
−0.
4−
0.2
0.0
0.2
0.4
Lag 1 day temperature
Effe
ct o
n de
man
d
10 15 20 25 30
−0.
4−
0.2
0.0
0.2
0.4
Last week average temp
Effe
ct o
n de
man
d
15 25 35
−0.
4−
0.2
0.0
0.2
0.4
Previous max temp
Effe
ct o
n de
man
d
10 15 20 25
−0.
4−
0.2
0.0
0.2
0.4
Previous min temp
Effe
ct o
n de
man
d
Time: 3:00 pm
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
Same predictors used for all 48 models.Predictors chosen by cross-validation onsummer of 2007/2008 and 2009/2010.Each model is fitted to the data twice, firstexcluding the summer of 2009/2010 and thenexcluding the summer of 2010/2011. Theaverage out-of-sample MSE is calculated fromthe omitted data for the time periods12noon–8.30pm.
Man vs Wild Data Going to extremes 33
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
Same predictors used for all 48 models.Predictors chosen by cross-validation onsummer of 2007/2008 and 2009/2010.Each model is fitted to the data twice, firstexcluding the summer of 2009/2010 and thenexcluding the summer of 2010/2011. Theaverage out-of-sample MSE is calculated fromthe omitted data for the time periods12noon–8.30pm.
Man vs Wild Data Going to extremes 33
Monash Electricity Forecasting Model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
Same predictors used for all 48 models.Predictors chosen by cross-validation onsummer of 2007/2008 and 2009/2010.Each model is fitted to the data twice, firstexcluding the summer of 2009/2010 and thenexcluding the summer of 2010/2011. Theaverage out-of-sample MSE is calculated fromthe omitted data for the time periods12noon–8.30pm.
Man vs Wild Data Going to extremes 33
Half-hourly modelsx x1 x2 x3 x4 x5 x6 x48 x96 x144 x192 x240 x288 d d1 d2 d3 d4 d5 d6 d48 d96 d144 d192 d240 d288 x+ x− x̄ dow hol dos MSE
1 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0372 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0343 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0314 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0275 • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0256 • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0207 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0258 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.0269 • • • • • • • • • • • • • • • • • • • • • • • • • 1.035
10 • • • • • • • • • • • • • • • • • • • • • • • • 1.04411 • • • • • • • • • • • • • • • • • • • • • • • 1.05712 • • • • • • • • • • • • • • • • • • • • • • 1.07613 • • • • • • • • • • • • • • • • • • • • • 1.10214 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.01815 • • • • • • • • • • • • • • • • • • • • • • • • • 1.02116 • • • • • • • • • • • • • • • • • • • • • • • • 1.03717 • • • • • • • • • • • • • • • • • • • • • • • 1.07418 • • • • • • • • • • • • • • • • • • • • • • 1.15219 • • • • • • • • • • • • • • • • • • • • • 1.18020 • • • • • • • • • • • • • • • • • • • • • • • • • 1.02121 • • • • • • • • • • • • • • • • • • • • • • • • 1.02722 • • • • • • • • • • • • • • • • • • • • • • • 1.03823 • • • • • • • • • • • • • • • • • • • • • • 1.05624 • • • • • • • • • • • • • • • • • • • • • 1.08625 • • • • • • • • • • • • • • • • • • • • 1.13526 • • • • • • • • • • • • • • • • • • • • • • • • • 1.00927 • • • • • • • • • • • • • • • • • • • • • • • • • 1.06328 • • • • • • • • • • • • • • • • • • • • • • • • • 1.02829 • • • • • • • • • • • • • • • • • • • • • • • • • 3.52330 • • • • • • • • • • • • • • • • • • • • • • • • • 2.14331 • • • • • • • • • • • • • • • • • • • • • • • • • 1.523
Man vs Wild Data Going to extremes 34
Half-hourly models
Man vs Wild Data Going to extremes 35
6070
8090
R−squared
Time of day
R−
squa
red
(%)
12 midnight 6:00 am 9:00 am 12 noon 3:00 pm 6:00 pm 9:00 pm3:00 am 12 midnight
Half-hourly models
Man vs Wild Data Going to extremes 35
South Australian demand (January 2011)
Date in January
Sou
th A
ustr
alia
n de
man
d (G
W)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
ActualFitted
Temperatures (January 2011)
Date in January
Tem
pera
ture
(de
g C
)
1015
2025
3035
4045
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Kent TownAirport
Half-hourly models
Man vs Wild Data Going to extremes 35
Half-hourly models
Man vs Wild Data Going to extremes 35
Adjusted model
Original model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
Model allowing saturated usage
qt = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
log(yt) =
{qt if qt ≤ τ ;τ + k(qt − τ) if qt > τ .
Man vs Wild Data Going to extremes 36
Adjusted model
Original model
log(yt) = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
Model allowing saturated usage
qt = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
log(yt) =
{qt if qt ≤ τ ;τ + k(qt − τ) if qt > τ .
Man vs Wild Data Going to extremes 36
Peak demand forecasting
qt,p = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
Multiple alternative futures created:hp(t) known;simulate future temperatures using doubleseasonal block bootstrap with variableblocks (with adjustment for climate change);use assumed values for GSP, population andprice;resample residuals using double seasonal blockbootstrap with variable blocks.
Man vs Wild Data Going to extremes 37
Peak demand backcasting
qt,p = hp(t) + fp(w1,t,w2,t) +
J∑j=1
cjzj,t + nt
Multiple alternative pasts created:hp(t) known;simulate past temperatures using doubleseasonal block bootstrap with variableblocks;use actual values for GSP, population andprice;resample residuals using double seasonal blockbootstrap with variable blocks.
Man vs Wild Data Going to extremes 37
Peak demand backcasting
Man vs Wild Data Going to extremes 38
PoE (annual interpretation)
Year
PoE
Dem
and
2.0
2.5
3.0
3.5
4.0
98/99 00/01 02/03 04/05 06/07 08/09 10/11
10 %50 %90 %
●
●
●
●
●
●
●●
● ●
●
●
●
●
Peak demand forecasting
Man vs Wild Data Going to extremes 39
South Australia GSP
Year
billi
on d
olla
rs (
08/0
9 do
llars
)
1990 1995 2000 2005 2010 2015 2020
4060
8010
012
0
HighBaseLow
South Australia population
Year
mill
ion
1990 1995 2000 2005 2010 2015 2020
1.4
1.6
1.8
2.0
HighBaseLow
Average electricity prices
Year
c/kW
h
1990 1995 2000 2005 2010 2015 2020
1214
1618
2022
HighBaseLow
Major industrial offset demand
Year
MW
1990 1995 2000 2005 2010 2015 2020
010
020
030
040
0
HighBaseLow
Peak demand distribution
Man vs Wild Data Going to extremes 40
Annual POE levels
Year
PoE
Dem
and
23
45
6
98/99 00/01 02/03 04/05 06/07 08/09 10/11 12/13 14/15 16/17 18/19 20/21
●●
●
●
●
● ●
● ●
●
●
●●
●
1 % POE5 % POE10 % POE50 % POE90 % POEActual annual maximum
ResultsWe have successfully forecast the extreme upper tail inten years time using only twelve years of data!
This method has now been adopted for the officiallong-term peak electricity demand forecasts for all statesexcept WA.
Some lessonsCross-validation is very useful in predictionproblems.
Statistical modelling is an iterative process.
Getting client understanding of percentiles isextremely difficult.
Beware of clients who think they know morethan you!
Man vs Wild Data Going to extremes 41
ResultsWe have successfully forecast the extreme upper tail inten years time using only twelve years of data!
This method has now been adopted for the officiallong-term peak electricity demand forecasts for all statesexcept WA.
Some lessonsCross-validation is very useful in predictionproblems.
Statistical modelling is an iterative process.
Getting client understanding of percentiles isextremely difficult.
Beware of clients who think they know morethan you!
Man vs Wild Data Going to extremes 41
ResultsWe have successfully forecast the extreme upper tail inten years time using only twelve years of data!
This method has now been adopted for the officiallong-term peak electricity demand forecasts for all statesexcept WA.
Some lessonsCross-validation is very useful in predictionproblems.
Statistical modelling is an iterative process.
Getting client understanding of percentiles isextremely difficult.
Beware of clients who think they know morethan you!
Man vs Wild Data Going to extremes 41
ResultsWe have successfully forecast the extreme upper tail inten years time using only twelve years of data!
This method has now been adopted for the officiallong-term peak electricity demand forecasts for all statesexcept WA.
Some lessonsCross-validation is very useful in predictionproblems.
Statistical modelling is an iterative process.
Getting client understanding of percentiles isextremely difficult.
Beware of clients who think they know morethan you!
Man vs Wild Data Going to extremes 41
Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data Final thoughts 42
Crazy clients
The client who wouldn’t tell me the
problem.
The client who wanted all meetings
held at random locations for security
reasons.
The client who didn’t like the answer.
Expert witnessing on the color purple
(and now yellow).
Man vs Wild Data Final thoughts 43
Crazy clients
The client who wouldn’t tell me the
problem.
The client who wanted all meetings
held at random locations for security
reasons.
The client who didn’t like the answer.
Expert witnessing on the color purple
(and now yellow).
Man vs Wild Data Final thoughts 43
Crazy clients
The client who wouldn’t tell me the
problem.
The client who wanted all meetings
held at random locations for security
reasons.
The client who didn’t like the answer.
Expert witnessing on the color purple
(and now yellow).
Man vs Wild Data Final thoughts 43
Crazy clients
The client who wouldn’t tell me the
problem.
The client who wanted all meetings
held at random locations for security
reasons.
The client who didn’t like the answer.
Expert witnessing on the color purple
(and now yellow).
Man vs Wild Data Final thoughts 43
Go forth and consult
A good statistician is not smarter than
everyone else, he merely has his ignorance
better organised.
(Anonymous)
Man vs Wild Data Final thoughts 44
Go forth and consult
All models are wrong, some are useful.
(George E P Box)
Man vs Wild Data Final thoughts 44
Go forth and consult
It is better to solve the right problem the
wrong way than the wrong problem the
right way.
(John W Tukey)
Man vs Wild Data Final thoughts 44
Go forth and consult
It is better to solve the right problem the
wrong way than the wrong problem the
right way.
(John W Tukey)
Slides available from robjhyndman.com
Man vs Wild Data Final thoughts 44