Statistical model building Marian Scott and Ron Smith Dept of Statistics, University of Glasgow, CEH Glasgow, Aug 2008
Mar 28, 2015
Statistical model building
Marian Scott and Ron SmithDept of Statistics, University of Glasgow, CEH
Glasgow, Aug 2008
Outline of presentation
Statistical models- what are the principles – describing variation– empiricism
Fitting models- calibration Testing models- validation or verification Quantifying and apportioning variation in model and
data. Stochastic and deterministic models. intro to uncertainty and sensitivity analysis
Step 1
why do you want to build a model- what is your objective?
what data are available and how were they collected?
is there a natural response or outcome and other explanatory variables or covariates?
Modelling objectives
explore relationships make predictions improve understanding test hypotheses
Conceptual system
Data
Model
Policy
inputs & parameters
model results
feedbacks
Why model?
Purposes of modelling:– Describe/summarise– Predict - what if….– Test hypotheses– Manage
What is a good model?– Simple, realistic, efficient, reliable, valid
Value judgements
Different criteria of unequal importance key comparison often comparison to
observational data
but such comparisons must include the model
uncertainties and the uncertainties on the observational data.
Questions we ask about models
Is the model valid? Are the assumptions
reasonable? Does the model make
sense based on best scientific knowledge?
Is the model credible? Do the model predictions
match the observed data?
How uncertain are the results?
Stages in modelling
Design and conceptualisation:– Visualisation of structure– Identification of processes– Choice of parameterisation
Fitting and assessment– parameter estimation (calibration)– Goodness of fit
a visual model- atmospheric flux of pollutants
•Atmospheric pollutants dispersed over Europe
•In the 1970’ considerable environmental damage caused by acid rain
•International action
•Development of EMEP programme, models and measurements
The mathematical flux model
L: Monin-Obukhov length
u*: Friction velocity of wind
cp: constant (=1.01)
: constant (=1246 gm-3)
T: air temperature (in Kelvin)
k: constant (=0.41)
g: gravitational force (=9.81m/s)
H: the rate of heat transfer per unit area
gasht: Current height that measurements are taken at.
d: zero plane displacement
what would a statistician do if confronted with this problem?
Look at the data understand the measurement processes think about how the scientific knowledge,
conceptual model relates to what we have measured
Step 2- understand your data
study your data learn its properties tools- graphical
The data- variation
soil or sediment samples taken side-by-side, from different parts of the same plant, or from different animals in the same environment, exhibit different activity densities of a given radionuclide.
The distribution of values observed will provide an estimate of the variability inherent in the population of samples that, theoretically, could be taken.
Data
Frequency
6.05.55.04.54.03.5
20
15
10
5
0
4.647 0.3815 59
4.704 0.6001 14
Mean StDev N
alllogtlogt2007
Variable
Normal Histogram of log activity
Activity (log10) of particles (Bq Cs-137) with Normal or Gaussian density superimposed
Variation
measured atmsopheric fluxes for 1997
•measured fluxes for 1997 are still noisy.
•Is there a statistical signal and at what timescale?
0
5
10
15
100 200 300
19
97
Flu
xe
s
Index
Key properties of any measurement
Accuracy refers to the deviation of the measurement from the ‘true’ value
Precision refers to the variation in a series of replicate measurements (obtained under identical conditions)
Accurate
Imprecise
Inaccurate
Precise
Accuracy and precision
Evaluation of accuracy
In a laboratory inter-comparison, known-concentration material is used to define the ‘true’ concentration
The figure shows a measure of accuracy for individual laboratories
Accuracy is linked to Bias
1009080706050403020100
500400300200100
0-100-200-300-400-500-600
laboratory identifier
Off
set (
yea
rs B
P)
Evaluation of precision
Analysis of the instrumentation method to make a single measurement, and the propagation of any errors
Repeat measurements (true replicates) – using homogeneous material, repeatedly subsampling, etc….
Precision is linked to Variance (standard deviation)
The nature of measurement
All measurement is subject to uncertainty Analytical uncertainty reflects that every time a
measurement is made (under identical conditions), the result is different.
Sampling uncertainty represents the ‘natural’ variation in the organism within the environment.
The error and uncertainty in a measurement
The error is a single value, which represents the difference between the measured value and the true value
The uncertainty is a range of values, and describes the errors which might have been observed were the measurement repeated under IDENTICAL conditions
Error (and uncertainty) includes a combination of variance and bias
Effect of uncertainties
Lack of observations contribute to– uncertainties in input data– uncertainty in model parameter values
Conflicting evidence contributes to– uncertainty about model form– uncertainty about validity of
assumptions
Step 3- build the statistical model
Outcomes or Responsesthese are the results of the practical work and are sometimes referred to as ‘dependent variables’.
Causes or Explanationsthese are the conditions or environment within which the outcomes or responses have been observed and are sometimes referred to as ‘independent variables’, but more commonly known as covariates.
Statistical models
In experiments many of the covariates have been determined by the experimenter but some may be aspects that the experimenter has no control over but that are relevant to the outcomes or responses.
In observational studies, these are usually not under the control of the experimenter but are recorded as possible explanations of the outcomes or responses.
Specifying a statistical models
Models specify the way in which outcomes and causes link together, eg.
Metabolite = Temperature The = sign does not indicate equality in a mathematical
sense and there should be an additional item on the right hand side giving a formula:-
Metabolite = Temperature + Error
Specifying a statistical models
Metabolite = Temperature + Error In mathematical terms, there will be some unknown
parameters to be estimated, and some assumptions will be made about the error distribution
Metabolite = + temperature + ~ N(0, σ2)
statistical model interpretation
Metabolite = Temperature + Error
The outcome Metabolite is explained by Temperature and other things that we have not recorded which we call Error.
The task that we then have in terms of data analysis is simply to find out if the effect that Temperature has is ‘large’ in comparison to that which Error has so that we can say whether or not the Metabolite that we observe is explained by Temperature.
Model calibration
Statisticians tend to talk about model fitting, calibration means something else to them.
Methods- least squares or maximum likelihood
least squares:- find the parameter estimates that minimise the sum of squares (SS)
SS=(observed y- model fitted y)2
maximum likelihood- find the parameter estimates that maximise the likelihood of the data
Calibration-using the data
A good idea, if possible to have a training and a test set of data-split the data (90%/10%)
Fit the model using the training set, evaluate the model using the test set.
why? because if we assess how well the model
performs on the data that were used to fit it, then we are being over optimistic
other methods: bootstrap and jackknife
Model validation
what is validation? Fit the model using the training set, evaluate
the model using the test set. why? because if we assess how well the model
performs on the data that were used to fit it, then we are being over optimistic
other methods: bootstrap and jackknife
Example 4: Models- how well should models agree?
6 ocean models (process based-transport, sedimentary processes, numerical solution scheme, grid size) used to predict the dispersal of a pollutant
Results to be used to determine a remediation policy for an illegal dumping of “radioactive waste” The what if scenario investigation
The models differ in their detail and also in their spatial scale
Predictions of levels of cobalt-60
Different models, same input data
Predictions vary by considerable margins
Magnitude of variation a function of spatial distribution of sites
tiwtistcwtcsbiwbisbcwbcs
250
150
50
Simulation condition
CV
(%)
CV(%) for location 7
tiwtistcwtcsbiwbisbcwbcs
250
150
50
Simulation condition
CV
(%)
CV(%) for location 8
tiwtistcwtcsbiwbisbcwbcs
250
150
50
Simulation condition
CV
(%)
CV(%) for location 9
tiwtistcwtcsbiwbisbcwbcs
250
150
50
Simulation condition
CV
(%)
CV(%) for location 10
tiwtistcwtcsbiwbisbcwbcs
250
150
50
Simulation condition
CV
(%)
CV(%) for location 11
Statistical models and process models
Loch Leven, modelling nutrients process model based on differential equations statistical model based on empirically
determined relationships
Log SRP
Years
Lo
g S
RP
, m
ug
/l
1970 1980 1990 2000
-20
24
Log TP
Years
Lo
g T
P,
mu
g/l
1970 1980 1990 2000
3.5
4.0
4.5
5.0
Log Secchi
Years
Lo
g S
ecc
hi,
me
tre
s
1970 1980 1990 2000
-0.5
0.0
0.5
1.0
Log Daphnia
Years
Lo
g D
ap
hn
ia,
ind
ivid
ua
ls/l
1970 1980 1990 2000
-4-2
02
4
Log Chlorophyll
Years
Lo
g C
hlo
rop
hyl
l, m
ug
/l
1970 1980 1990 2000
01
23
45
Water Temperature
Years
Wa
ter
Tem
pe
ratu
re,
oC
1970 1980 1990 20000
51
01
52
0
Loch Leven
Log SRP
Years
Lo
g S
RP
, m
ug
/l
1970 1980 1990 2000
-20
24
Log TP
YearsL
og
TP
, m
ug
/l1970 1980 1990 2000
3.5
4.0
4.5
5.0
Log Secchi
Years
Lo
g S
ecc
hi,
me
tre
s
1970 1980 1990 2000
-0.5
0.0
0.5
1.0
Log Daphnia
Years
Lo
g D
ap
hn
ia,
ind
ivid
ua
ls/l
1970 1980 1990 2000
-4-2
02
4
Log Chlorophyll
Years
Lo
g C
hlo
rop
hyl
l, m
ug
/l
1970 1980 1990 2000
01
23
45
Water Temperature
Years
Wa
ter
Tem
pe
ratu
re,
oC
1970 1980 1990 20000
51
01
52
0
Loch Leven
Loch LevenLog SRP
Month
Lo
g S
RP
, m
ug
/l
2 4 6 8 10 12
-20
24
Log TP
MonthL
og
TP
, m
ug
/l
2 4 6 8 10 12
3.5
4.0
4.5
5.0
Log Chlorophyll
Month
Lo
g C
hlo
rop
hyl
l, m
ug
/l
2 4 6 8 10 12
01
23
45
Log Daphnia
Month
Lo
g D
ap
hn
ia,
ind
ivid
ua
ls/l
2 4 6 8 10 12
-4-2
02
4
Log Secchi
Month
Lo
g S
ecc
hi,
me
tre
s
2 4 6 8 10 12
-0.5
0.0
0.5
1.0
Water Temperature
Month
Wa
ter
Tem
pe
ratu
re,
oC
2 4 6 8 10 120
51
01
52
0
Uncertainty and sensitivity analysis
Uncertainty (in variables, models, parameters,
data) what are uncertainty and sensitivity analyses? an example.
Effect of uncertainties
Lack of observations contribute to– uncertainties in input data– uncertainty in model parameter values
Conflicting evidence contributes to– uncertainty about model form– uncertainty about validity of
assumptions
Modelling tools - SA/UA
Sensitivity analysis
determining the amount and kind of change produced in the model predictions by a change in a model parameter
Uncertainty analysis
an assessment/quantification of the uncertainties associated with the parameters, the data and the model structure.
Modellers conduct SA to determine
(a) if a model resembles the system or processes under study,
(b) the factors that mostly contribute to the output variability,
(c) the model parameters (or parts of the model itself) that are insignificant,
(d) if there is some region in the space of input factors for which the model variation is maximum,
and(e) if and which (group of) factors interact with each
other.
SA flow chart (Saltelli, Chan and Scott, 2000)
Design of the SA experiment
Simple factorial designs (one at a time) Factorial designs (including potential
interaction terms) Fractional factorial designs Important difference: design in the context of
computer code experiments – random variation due to variation in experimental units does not exist.
SA techniques
Screening techniques– O(ne) A(t) T(ime), factorial, fractional factorial
designs used to isolate a set of important factors
Local/differential analysis Sampling-based (Monte Carlo) methods Variance based methods
– variance decomposition of output to compute sensitivity indices
Screening
screening experiments can be used to identify the parameter subset that controls most of the output variability with low computational effort.
Screening methods
Vary one factor at a time (NOT particularly recommended)
Morris OAT design (global)– Estimate the main effect of a factor by computing a
number r of local measures at different points x1,…,xr in the input space and then average them.
– Order the input factors
Local SA
Local SA concentrates on the local impact of the factors on the model. Local SA is usually carried out by computing partial derivatives of the output functions with respect to the input variables.
The input parameters are varied in a small interval around a nominal value. The interval is usually the same for all of the variables and is not related to the degree of knowledge of the variables.
Global SA
Global SA apportions the output uncertainty to the uncertainty in the input factors, covering their entire range space.
A global method evaluates the effect of xj while all other xi,ij are varied as well.
How is a sampling (global) based SA implemented?
Step 1: define model, input factors and outputs
Step 2: assign p.d.f.’s to input parameters/factors and if necessary covariance structure. DIFFICULT
Step 3: simulate realisations from the parameter pdfs to generate a set of model runs giving the set of output values.
Choice of sampling method
S(imple) or Stratified R(andom) S(ampling)– Each input factor sampled independently many times from
marginal distbns to create the set of input values (or randomly sampled from joint distbn.)
– Expensive (relatively) in computational effort if model has many input factors, may not give good coverage of the entire range space
L(atin) H(ypercube) S(sampling)– The range of each input factor is categorised into N equal
probability intervals, one observation of each input factor made in each interval.
SA -analysis
At the end of the computer experiment, data is of the form (yij, x1i,x2i,….,xni), where x1,..,xn are the realisations of the input factors.
Analysis includes regression analysis (on raw and ranked values), standard hypothesis tests of distribution (mean and variance) for subsamples corresponding to given percentiles of x, and Analysis of Variance.
Some ‘newer’ methods of analysis
Measures of importance
VarXi(E(Y|Xj =xj))/Var(Y)
HIM(Xj) =yiyi’/N
Sobol sensitivity indices Fourier Amplitude Sensitivity Test (FAST)
How can SA/UA help?
SA/UA have a role to play in all modelling stages:– We learn about model behaviour and ‘robustness’ to
change;– We can generate an envelope of ‘outcomes’ and
see whether the observations fall within the envelope;
– We can ‘tune’ the model and identify reasons/causes for differences between model and observations
On the other hand - Uncertainty analysis
Parameter uncertainty– usually quantified in form of a distribution.
Model structural uncertainty– more than one model may be fit, expressed as a
prior on model structure.
Scenario uncertainty– uncertainty on future conditions.
Tools for handling uncertainty
Parameter uncertainty– Probability distributions and Sensitivity analysis
Structural uncertainty– Bayesian framework
one possibility to define a discrete set of models, other possibility to use a Gaussian process
– model averaging
An uncertainty example (1)
Wet deposition is rainfall ion concentration
Rainfall is measured at approximately 4000 locations, map produced by UK Met Office.
Rain ion concentrations are measured weekly (now fortnightly or monthly) at around 32 locations.
An uncertainty example (2)
BUT• almost all measurements are at low altitudes• much of Britain is uplandAND measurement campaigns show• rain increases with altitude• rain ion concentrations increase with altitude
Seeder rain, falling through feeder rain on hills, scavenges cloud droplets with high pollutant concentrations.
An uncertainty example (3)
Solutions: (a) More measurements
X at high altitude are not routine and are complicated
(b) Derive relationship with altitudeX rain shadow and wind drift (over about 10km down
wind) confound any direct altitude relationships(c) Derive relationship from rainfall map
model rainfall in 2 separate components
An uncertainty example (4)
An uncertainty example (5)
Wet deposition is modelled by
r actual rainfalls rainfall on ‘low’ ground (r = s on ‘low’ ground, and
(r-s) is excess rainfall caused by the hill)c rain ion concentration as measured on ‘low’ groundf enhancement factor (ratio of rain ion concentration
in excess rainfall to rain ion concentration in‘low’ground rainfall)
deposition = s.c + (r-s).c.f
An uncertainty example (6)
Rainfall Concentration
Deposition
An uncertainty example (7)
a)modelled rainfall to 5km squares provided by UKMO - unknown uncertainty
scale issue - rainfall a point measurementmeasurement issue - rain gauges difficult
touse at high altitude
optimistic 30% pessimistic 50%
how is the uncertainty represented?(not e.g. 30% everywhere)
An uncertainty example (8)
b)some sort of smoothed surface(change in prevalence of westerly winds
means it alters between years) c)kriged interpolation of annual
rainfall weighted mean concentrations(variogram not well specified)assume 90% of observations within ±10% of correct value
d)campaign measurements indicate valuesbetween 1.5 and 3.5
An uncertainty example (9)
Output measures in the sensitivity analysis are the average flux (kg S ha-1 y-1) for
(a) GB, and(b) 3 sample areas
An uncertainty example (10)
Morris indices are one way of determining which effects are more important than others, so reducing further work.
but different parameters are important in different areas
An uncertainty example (11)
100 simulations Latin Hypercube Sampling of 3 uncertainty factors:
enhancement ratio% error in rainfall map% error in concentration
An uncertainty example (12)
Note skewed distributions for GB and for the 3 selected areas
An uncertainty example (13)
OriginalMean of 100 simulations
Standard deviation
An uncertainty example (14)
CV from 100 simulations
Possible bias from 100 simulations
An uncertainty example (15)
• model sensitivity analysis identifies weak areas• lack of knowledge of accuracy of inputs a
significant problem• there may be biases in the model output which,
although probably small in this case, may be important for critical loads
Conclusions
The world is rich and varied in its complexity Modelling is an uncertain activity
SA/UA are an important tools in model assessment The setting of the problem in a unified Bayesian
framework allows all the sources of uncertainty to be quantified, so a fuller assessment to be performed.
Challenges
Some challenges: different terminologies in different subject areas. need more sophisticated tools to deal with multivariate
nature of problem. challenges in describing the distribution of input
parameters. challenges in dealing with the Bayesian formulation of
structural uncertainty for complex models. Computational challenges in simulations for large
and complex computer models with many factors.